Transpose a row Question in Parallel jobs

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

pavankvk
Participant
Posts: 202
Joined: Thu Dec 04, 2003 7:54 am

Transpose a row Question in Parallel jobs

Post by pavankvk »

Hi

i have a requirement which is as below.

i have 2 rows
A,P1
A,P2

A is the key. i want a ouput row as

A,P1,P2

I did this once using awk,but its too slow for huge volumes of data. So i want this to be done in datastage. Is it possible?

tia
pavan
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

What you're trying to achieve is called a vertical pivot - you can search the Forum for techniques. For a small, finite number of rows the stage variables approach is indicated, but you need to ensure that all values for each pivot key occur on the same processing node. Do this by partitioning and sorting on the key. Sort additionally on the non-key column.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
pavankvk
Participant
Posts: 202
Joined: Thu Dec 04, 2003 7:54 am

Post by pavankvk »

ray.wurlod wrote:What you're trying to achieve is called a vertical pivot - you can search the Forum for techniques. For a small, finite number of rows the stage variables approach is indicated, but you need to ensure that all values for each pivot key occur on the same processing node. Do this by partitioning and sorting on the key. Sort additionally on the non-key column.
Ray

thanks for replying.i searched for vertical pivot and i end up at many posts asking the same question and i see the same reply from you :-)),but i couldnt find the extact design suggestion. i tried aggregator but my data is 50 mil+ and i had problems like aggregator crashing. so i was looking for some other question.I know its not fair to ask,but can you please post it here if its a simple design. i am using parallel extender
ameyvaidya
Charter Member
Charter Member
Posts: 166
Joined: Wed Mar 16, 2005 6:52 am
Location: Mumbai, India

Post by ameyvaidya »

Try <a href="viewtopic.php?t=85702&postdays=0&postor ... 0">this</a> one..

It details a server solution.. but it will work in a parallel Transformer too.
Amey Vaidya<i>
I am rarely happier than when spending an entire day programming my computer to perform automatically a task that it would otherwise take me a good ten seconds to do by hand.</i>
<i>- Douglas Adams</i>
Ultramundane
Participant
Posts: 407
Joined: Mon Jun 27, 2005 8:54 am
Location: Walker, Michigan
Contact:

Post by Ultramundane »

I wrote a solution to the specific problem at hand using awk and it is very fast. It can process 2.5 million records at 712 Bytes/Record in 140 seconds on my system with A changing every 10 values. That is, process a 1,780,000,000 Byte file with 2.5 million records and A changing every 10 values in 140 seconds. Very fast. The awk processed 25 million records at 10 bytes/record in 59 seconds on my system. It slowed down with significant values of A. With 144 Bytes/Record the 2.5 million record file with A changing every 1000 values took 400 seconds.

Note: You will want to have your file sorted by field/column 1.

I tested a normal pivot and it ran in 85 minutes using the pivot stage and using awk it only took 4 minutes. So, I have a hard time believing that Datastage can outperform awk on the vertical pivot.

Code: Select all

#!/bin/ksh

#################################################################################################
##
## Author: Ryan Putnam
##
## Revise: 01/03/2005   Ryan Putnam     Created
##
#################################################################################################

. ~/.dsadm

cat - \
| awk 'BEGIN{FS=",";OFS=",";FOUND=1;FIRST=0;}
{

if (FIRST != 0)
{
   if ($1 != PVAL)
   {
      print STR;
      FOUND=0;
   }

   if (FOUND == 0)
   {
      STR=$1 OFS $2;
      FOUND=1;
   }
   else
   {
      STR=STR OFS $2;
   }

   PVAL=$1;
}
else
{
   STR=$1 OFS $2;
   FIRST=1;
   PVAL=$1;
   FOUND=1;
}

}
END{if (FOUND == 1) {print STR;}}'

exit 0
Ultramundane
Participant
Posts: 407
Joined: Mon Jun 27, 2005 8:54 am
Location: Walker, Michigan
Contact:

Post by Ultramundane »

Forgot that you could set ORS to an empty string. This one is much faster and does not slow down with more frequent values. It just keeps streaming the input out until a change occurs.

Code: Select all

#!/bin/ksh

#################################################################################################
##
## Author: Ryan Putnam
##
## Revise: 01/03/2005   Ryan Putnam     Created
##
#################################################################################################

. ~/.dsadm

cat - \
| awk 'BEGIN{FS=",";OFS=",";ORS="";FIRST=1;}
{

if ($1 != PVAL)
{
   if (FIRST == 0)
   {
      print "\n"$1 OFS $2;
   }
   else
   {
      FIRST=0;
      print $1 OFS $2;
   }
}
else
{
   print OFS $2;
}

PVAL=$1;
}'

exit 0
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

I don't recall anyone claiming that Pivot stage was faster than awk.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Ultramundane
Participant
Posts: 407
Joined: Mon Jun 27, 2005 8:54 am
Location: Walker, Michigan
Contact:

Post by Ultramundane »

I believe Pavan originally stated that an awk script he wrote was slow and that awk was too slow and that he wanted to code this in Datastage because it is faster?
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

With parallel processing DataStage should be faster then an awk script.

If you have a static vertical pivot, where you know how many rows belong to each key, then it is easy to do in a parallel job. If you have a dynamic number of rows for each key it becomes quite difficult. In the example link provided by ameyvaidya there was a particular value that indicated the end of a set of keys that provided the output trigger so this was a predictable dynamic vertical pivot.

A transformer vertical pivot works by outputting a row after all key rows have been collected, you only know you have all the rows when you find a row with a new key. This means you actually output previous row data, not the current row data.

When you get to the end of the output you need a dummy row so you can output the last key combination. In DataStage there is no easy way to determine that you are at the end of the data so the dummy row becomes necessary. In a parallel job a dummy row would only be delivered to one of the transformer partitions and the others would miss out.

Now if you have a static number of rows per key then you could simply use a counter to tell you when to output a row. Let's assume you are trying to vertically pivot 2 rows at a time. This means you can output every second row.

svKeyCount if input.custid = svPreviousCustId then svKeyCount + 1 else 1
svPreviousCustId input.custid
svCustValues if svKeyCount = 1 then input.custcode else svCustValues : "," : NullToValue(input.custcode, ' ')

Your constraint is to only output a row when svKeyCount = 2. You output input.custid and svCustValues. svCustValues collects your data across the two rows.
Ultramundane
Participant
Posts: 407
Joined: Mon Jun 27, 2005 8:54 am
Location: Walker, Michigan
Contact:

Post by Ultramundane »

It might be faster with the vertical pivot, but with the horizontal pivot (even in parallel), awk is still 20 times faster. I have a case open with Ascential on the terrible performance of this stage. But this is a different issue. horizontal vs. vertical. Or, maybe it isn't. Since the stage is called just pivot, it should have functionality to allow for both flavors.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Ultramundane wrote:I believe Pavan originally stated that an awk script he wrote was slow and that awk was too slow and that he wanted to code this in Datastage because it is faster?
No, I've re-read the original post. He only asked whether it is possible in DataStage. He did claim that awk was too slow, but made no claim about DataStage being faster.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Ultramundane
Participant
Posts: 407
Joined: Mon Jun 27, 2005 8:54 am
Location: Walker, Michigan
Contact:

Post by Ultramundane »

I think one can reasonably deduce that the below statement did have implications/claims about Datastage being faster.

"I did this once using awk,but its too slow for huge volumes of data. So i want this to be done in datastage."
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

Tie-breaker comment, I read it the same way as Ultramundane. Hopefully a user wouldn't be looking for a slower solution :lol: using DataStage, therefore, DataStage must be the faster solution.

The proof is in the pudding, but I would hazard a guess that PX with parallel capabilities should tackle the pivot faster than awk. I might even go as far as betting a box of donuts on it, but just the glazed ones, not the chocolate with the sprinkles on top.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

Ultramundane wrote:It might be faster with the vertical pivot, but with the horizontal pivot (even in parallel), awk is still 20 times faster. I have a case open with Ascential on the terrible performance of this stage. But this is a different issue. horizontal vs. vertical. Or, maybe it isn't. Since the stage is called just pivot, it should have functionality to allow for both flavors.
Thanks for the tip. Vertical pivoting in the transformer stage has a small overhead in using the compiled stage. It is extremely efficient in that it only collects and holds a small amount of information in memory and refreshes that information with each key change. The pre-sorting does the hard work and a parallel sort is quite fast. It can deliver the rows to subsequent stages as it pivots which may give it the edge over awk.
Ultramundane
Participant
Posts: 407
Joined: Mon Jun 27, 2005 8:54 am
Location: Walker, Michigan
Contact:

Post by Ultramundane »

vmcburney wrote:Thanks for the tip. Vertical pivoting in the transformer stage has a small overhead in using the compiled stage. It is extremely efficient in that it only collects and holds a small amount of information in memory and refreshes that information with each key change. The pre-sorting does the hard work and a parallel sort is quite fast. It can deliver the rows to subsequent stages as it pivots which may give it the edge over awk.
I tried the method for the vertical pivot and the performance is about equal on my system. Awk was actually slightly faster, but that is probably due to my system only having two cpus. Others would likely see an increase in performance with more nodes and cpus.

I also wanted to come with a solution for horizontally pivoting in DS without the pivot stage. (I am sure there are many, but I am no expert).

I found the below solution to perform about as well as awk and about 20 times faster than the built-in pivot stage on my system.

One Source to read the key and each column to be pivoted about the key. Send output of each source to an Intermediate Independant Transformer to remap the column names and change datatypes if necessary.
Send all output from transformers to an Intermediate Funnel then to an Intermediate sort and finally to the Target.

Source -> Transformer
=> Funnel -> Sort -> Target
Source -> Transformer
Post Reply