Pivot stage too slow

sarathchandrakt · Post by **sarathchandrakt** » Mon Apr 20, 2015 8:49 am

My job structure is something like this,

DataSource>Remove Duplicates>Pivot Stage>Transformer>TargetDB

I have 3million records to process. My pivot stage is taking 90mins to do that. I have 8 input columns to Pivot stage out of which 2 need to be pivoted vertically.

In the beginning I grouped by all other 6 columns. Then I reduced it to only 1 column. Still there is no improvment in performance. I used Hash and Roundrobin partitioning. Hash actually decreased the performance where as Roundrobin brought doen the procesing time to 80Mins.

Can you suggest any other steps the improve the performance of Pivot stage?

Thank You.

ShaneMuir · Post by **ShaneMuir** » Mon Apr 20, 2015 9:39 am

Not sure its actually the pivot stage but more likely the sorting that is being done.

Before a remove duplicates and a pivot stage I am pretty sure that there is an sort inserted (as the data needs to be sorted prior to performing these activities).

Is the duplicate key the same as the pivot key? (or is the pivot key at least a subset of the duplicate key fields?)

Try and sort and partition your data as early as possible, either in your data source or using a sort stage, and where possible avoid repartitioning by setting the input links to the remove duplicates and pivot stages to have a partitioning of SAME.

sarathchandrakt · Post by **sarathchandrakt** » Mon Apr 20, 2015 11:36 am

Thank You Shane. I added a Sort between Pivot and DataSource and changed "allow duplicates" to false in Sort stage, so I don't need another remove duplicates. I partitioned the data in sort stage and kept it as same in Pivot. It worked like charm. Thank You again.