Page 1 of 1

Pivot stage giving very poor performance

Posted: Thu Jun 21, 2007 3:51 pm
by sidhartha_saha
I have about 5 million rows coming in the input. Each row has 28 columns, in addition to the key columns. I need to generate 28 rows from each input row. So the number of rows in output which will be fed to a teradata enterprise stage are 5*28 million rows. I am using the PIVOT stage to achieve this.

But, I'm getting very poor performance out of the PIVOT stage. Any idea how we can improve the performnace or use some other stage to achieve the same.

Posted: Thu Jun 21, 2007 3:57 pm
by ray.wurlod
Welcome aboard. :D

Think about what Pivot stage is doing. It takes a lot of resources, particularly memory, for processing a lot of rows. There is nothing you can do about that.

You can create your own Build stage to pivot your data according to your specific requirements, which removes all the necessary compromises that are in the Pivot stage.

Posted: Thu Jun 21, 2007 4:01 pm
by sidhartha_saha
Can you please elaborate on how we can do this ?

Posted: Thu Jun 21, 2007 4:05 pm
by ray.wurlod
Select the Stage Types branch of your Repository, choose New Parallel Stage. Fill in your C++ code on the appropriate tabs. Use the stage in a job design.

Posted: Thu Jun 21, 2007 7:13 pm
by vijayrc
Yes...Pivot performance is considerably poor. You can go about building BuildOp Stages using c++ code whatever you use the pivot stage for.

Posted: Fri Jun 22, 2007 6:31 am
by chulett
There's a poster by the name of Larry.Griffith who has written pivot BuildOps. One of the posts where he mentions that is in here. Might be worth contacting him and see if he can help, if he doesn't stop by and post something here.

Posted: Fri Jun 22, 2007 9:25 am
by Larry.Griffith
I have written wrapped stages that pivot tall and flat and run in parallel mode.

I wonder what you consider good performance. By wrapping a simple C routine I can pivot a flat fixed file, 1.65MM records, 24 columns into 14.9MM rows (no blank columns) with key_id, column number, and value on an AIX box in 40 seconds. We are running 3 nodes.

Changing the options of the wrapped stage to write out blank columns (for comparison purposes because the DataStage Pivot has no such option) the same job took 1:25. It would be faster if the data was cleaner and the file was delimited instead of fixed.

So you can get a benchmark against the DataStage Pivot, I ran the same file through it and it took 19:31.


What kind of performance are you getting/looking for?

Posted: Fri Jun 22, 2007 11:55 am
by sidhartha_saha
Thanks all for your suggestions.

We have a table which has 28 columns, in addition to the key columns. I need to convert every such row to 28 rows. I have about 5 million rows in the input. We have written a BuildOp stage as per the suggestions but that doesn't seem to make much of a difference.

Posted: Sat Jun 23, 2007 7:07 am
by ray.wurlod
That - along with the coding exercise itself - should tell you something about the amount of work that is involved.

It's time to manage expectations better.