How can I delete the columns in DataFlow Task in SSIS?
SQL Server 2016 and I have a very busy
DataFlow task. In my
DataFlow task, I use
Multicast component for some reason. After creating a new Flow in my
DataFlow, I need to delete some of the columns in the new flow because they are useless.
Just for more information, I need to do that because I have more than 200 columns in my flow and I need less than 10 of those columns.
How can I delete the columns in
DataFlow Task in SSIS?
3 Solutions collect form web for “How can I delete the columns in DataFlow Task in SSIS?”
You can add an extra component of some sort. However, this will never reduce complexity or improve performance. Just thinking about it, logically, you are adding an additional interface that needs to be maintained. Performance-wise, anything that will eliminate columns means copying one set of rows from one buffer to a whole other buffer. This is called an asynchronous transformation, and it is better described here and here. You can imagine that copying rows is less efficient than updating them in place.
Here are some recommendations for reducing complexity, which will, in turn, improve performance:
- Reduce the columns at the source. If you are selecting columns that
are not subsequently used in any way, then remove them from the query
or uncheck them from the source component. Removing columns in this way removes them from the buffer, which will occupy less memory.
- Reduce the number of components in the dataflow. Very long dataflows are easy to create, a pain to test and even harder to maintain. Dataflows are expecting a unit of work, i.e. a data stream from here to there with a few things in the middle. This is where dataflows shine, in fact, they protect themselves from complexity with memory limitations and a max number of threads. It is better to divide the work into separate dataflows or stored procs. You could stage the data into a table and read it twice, rather than use a multicast, for example.
- Use the database. SSIS is as much an orchestration tool as it is a data-moving tool. I have often found that using simple dataflows to stage the data, followed by calls to stored procedures to process the data, always out-performs an all-in-one dataflow.
- Increase the number of times you write the data. This is completely counter intuitive, but if you process data in smaller sets of operations, it is faster running and easier to test. Given a clean slate, I will often design an ETL to write data from the source to a staging table, perform a cleansing step from the stage table to another, optionally, add a conforming step to combine data from different sources to yet another table and, finally, a last step to load a target table. Note that each source is pushed to its own target table and later combined, leveraging the database. The first and last steps are set up to run fast and avoid locking or blocking at either end.
- Bulk Load. The prior step really does well, when you insure that bulk loading is happening. This can be a tricky thing, but generally you can get there by using “fast load” in the OLEDB destination and by never using the oledb command. Removing indexes and re-adding them is faster than loading in place (with few exceptions).
These guidelines will get you headed in the general direction, but do post more questions for tuning specific performance problems.
I believe that you can pass just one data flow path to a
UNION ALL task to remove columns from that single data flow.
Take the single data flow path that you would like to remove columns from and pass it to a
Union All task. Then open up the
Union All task right click on the column(s) you would like to remove from that path and select delete.
Usually I think the source of the data should be altered to not send the unwanted columns out, but your case is special. With one path out of the multicast needing all of the columns from the source, while one path does not.
First of all, i don’t think that what you are asking will give a better performance because the data is loaded from source then multiplied when using
The component that will reduce the column number…
You can do this multiple way:
If you can create another
DataFlow Taskwith a Reduced columns source (ex: OLEDB command with specific columns) it is better
You can add
Script componentwith an Asynchronous Output (like shown in the image below) and add the specifid columns to the output, map them using a Vb.net or C# script, something like this:
Output0Buffer.AddRow() Output0Budder.OutColumn = Row.inColumn
- Add a
UNION ALLcomponent and select the columns you need
Side Note: It is good to test each scenario performance and choose the better