Performance issues with transpose and insert large, variable column data files into SQL Server
I’m currently working on a project where we have a large data warehouse which imports several GB of data on a daily basis from a number of different sources. We have a lot of files with different formats and structures all being imported into a couple of base tables which we then transpose/pivot through stored procs. This part works fine. The initial import however, is awfully slow.
We can’t use SSIS File Connection Managers as the columns can be totally different from file to file so we have a custom object model in C# which transposes rows and columns of data into two base tables; one for column names, and another for the actual data in each cell, which is related to a record in the attribute table.
Example – Data Files:
alt text http://i50.tinypic.com/2ypkgf9.jpg
Example – DB tables:
alt text http://i45.tinypic.com/2iqhkoy.jpg
The SQL insert is performed currently by looping through all the data rows and appending the values to a SQL string. This constructs a large dynamic string which is then executed at the end via SqlCommand.
The problem is, even running in a 1MB file takes about a minute, so when it comes to large files (200MB etc) it takes hours to process a single file. I’m looking for suggestions as to other ways to approach the insert that will improve performance and speed up the process.
There are a few things I can do with the structure of the loop to cut down on the string size and number of SQL commands present in the string but ideally I’m looking for a cleaner, more robust approach. Apologies if I haven’t explained myself well, I’ll try and provide more detail if required.
Any ideas on how to speed up this process?
3 Solutions collect form web for “Performance issues with transpose and insert large, variable column data files into SQL Server”
The dynamic string is going to be SLOW. Each SQLCommand is a separate call to the database. You are much better off streaming the output as a bulk insertion operation.
I understand that all your files are different formats, so you are having to parse and unpivot in code to get it into your EAV database form.
However, because the output is in a consistent schema you would be better off either using separate connection managers and the built-in unpivot operator, or in a script task adding multiple rows to the data flow in the common output (just like you are currently doing in building your SQL INSERT…INSERT…INSERT for each input row) and then letting it all stream into a destination.
i.e. Read your data and in the script source, assign the FileID, RowId, AttributeName and Value to multiple rows (so this is doing the unpivot in code, but instead of generating a varying number of inserts, you are just inserting a varying number of rows into the dataflow based on the input row).
Then pass that through a lookup to get from AttributeName to AttributeID (erroring the rows with invalid attributes).
Stream straight into an OLEDB destination, and it should be a lot quicker.
One thought – are you repeatedly going back to the database to find the appropriate attribute value? If so, switching the repeated queries to a query against a recordset that you keep at the clientside will speed things up enormously.
This is something I have done before – 4 reference tables involved. Creating a local recordset and filtering that as appropriate caused a speed up of a process from 2.5 hours to about 3 minutes.
Why not store whatever reference tables are needed within each database and perform all lookups on the database end? Or it may even be better to pass a table type into each database where keys are needed, store all reference data in one central database and then perform your lookups there.