My job structure is something like this,
DataSource>Remove Duplicates>Pivot Stage>Transformer>TargetDB
I have 3million records to process. My pivot stage is taking 90mins to do that. I have 8 input columns to Pivot stage out of which 2 need to be pivoted vertically.
In the beginning I grouped by all other 6 columns. Then I reduced it to only 1 column. Still there is no improvment in performance. I used Hash and Roundrobin partitioning. Hash actually decreased the performance where as Roundrobin brought doen the procesing time to 80Mins.
Can you suggest any other steps the improve the performance of Pivot stage?
Thank You.
Pivot stage too slow
Moderators: chulett, rschirm, roy
-
- Participant
- Posts: 50
- Joined: Fri Aug 29, 2014 1:32 pm
- Location: Mumbai
Not sure its actually the pivot stage but more likely the sorting that is being done.
Before a remove duplicates and a pivot stage I am pretty sure that there is an sort inserted (as the data needs to be sorted prior to performing these activities).
Is the duplicate key the same as the pivot key? (or is the pivot key at least a subset of the duplicate key fields?)
Try and sort and partition your data as early as possible, either in your data source or using a sort stage, and where possible avoid repartitioning by setting the input links to the remove duplicates and pivot stages to have a partitioning of SAME.
Before a remove duplicates and a pivot stage I am pretty sure that there is an sort inserted (as the data needs to be sorted prior to performing these activities).
Is the duplicate key the same as the pivot key? (or is the pivot key at least a subset of the duplicate key fields?)
Try and sort and partition your data as early as possible, either in your data source or using a sort stage, and where possible avoid repartitioning by setting the input links to the remove duplicates and pivot stages to have a partitioning of SAME.
-
- Participant
- Posts: 50
- Joined: Fri Aug 29, 2014 1:32 pm
- Location: Mumbai