Transpose a row Question in Parallel jobs

pavankvk · Post by **pavankvk** » Thu Dec 29, 2005 2:07 pm

Hi

i have a requirement which is as below.

i have 2 rows
A,P1
A,P2

A is the key. i want a ouput row as

A,P1,P2

I did this once using awk,but its too slow for huge volumes of data. So i want this to be done in datastage. Is it possible?

tia
pavan

ray.wurlod · Post by **ray.wurlod** » Thu Dec 29, 2005 3:03 pm

What you're trying to achieve is called a vertical pivot - you can search the Forum for techniques. For a small, finite number of rows the stage variables approach is indicated, but you need to ensure that all values for each pivot key occur on the same processing node. Do this by partitioning and sorting on the key. Sort additionally on the non-key column.

pavankvk · Post by **pavankvk** » Thu Dec 29, 2005 4:45 pm

ray.wurlod wrote:What you're trying to achieve is called a vertical pivot - you can search the Forum for techniques. For a small, finite number of rows the stage variables approach is indicated, but you need to ensure that all values for each pivot key occur on the same processing node. Do this by partitioning and sorting on the key. Sort additionally on the non-key column.

Ray

thanks for replying.i searched for vertical pivot and i end up at many posts asking the same question and i see the same reply from you

),but i couldnt find the extact design suggestion. i tried aggregator but my data is 50 mil+ and i had problems like aggregator crashing. so i was looking for some other question.I know its not fair to ask,but can you please post it here if its a simple design. i am using parallel extender

ameyvaidya · Post by **ameyvaidya** » Thu Dec 29, 2005 5:18 pm

Try <a href="viewtopic.php?t=85702&postdays=0&postor ... 0">this</a> one..

It details a server solution.. but it will work in a parallel Transformer too.

Ultramundane · Post by **Ultramundane** » Tue Jan 03, 2006 1:05 pm

I wrote a solution to the specific problem at hand using awk and it is very fast. It can process 2.5 million records at 712 Bytes/Record in 140 seconds on my system with A changing every 10 values. That is, process a 1,780,000,000 Byte file with 2.5 million records and A changing every 10 values in 140 seconds. Very fast. The awk processed 25 million records at 10 bytes/record in 59 seconds on my system. It slowed down with significant values of A. With 144 Bytes/Record the 2.5 million record file with A changing every 1000 values took 400 seconds.

Note: You will want to have your file sorted by field/column 1.

I tested a normal pivot and it ran in 85 minutes using the pivot stage and using awk it only took 4 minutes. So, I have a hard time believing that Datastage can outperform awk on the vertical pivot.

Code: Select all

#!/bin/ksh

#################################################################################################
##
## Author: Ryan Putnam
##
## Revise: 01/03/2005   Ryan Putnam     Created
##
#################################################################################################

. ~/.dsadm

cat - \
| awk 'BEGIN{FS=",";OFS=",";FOUND=1;FIRST=0;}
{

if (FIRST != 0)
{
   if ($1 != PVAL)
   {
      print STR;
      FOUND=0;
   }

   if (FOUND == 0)
   {
      STR=$1 OFS $2;
      FOUND=1;
   }
   else
   {
      STR=STR OFS $2;
   }

   PVAL=$1;
}
else
{
   STR=$1 OFS $2;
   FIRST=1;
   PVAL=$1;
   FOUND=1;
}

}
END{if (FOUND == 1) {print STR;}}'

exit 0

Ultramundane · Post by **Ultramundane** » Tue Jan 03, 2006 3:49 pm

Forgot that you could set ORS to an empty string. This one is much faster and does not slow down with more frequent values. It just keeps streaming the input out until a change occurs.

Code: Select all

#!/bin/ksh

#################################################################################################
##
## Author: Ryan Putnam
##
## Revise: 01/03/2005   Ryan Putnam     Created
##
#################################################################################################

. ~/.dsadm

cat - \
| awk 'BEGIN{FS=",";OFS=",";ORS="";FIRST=1;}
{

if ($1 != PVAL)
{
   if (FIRST == 0)
   {
      print "\n"$1 OFS $2;
   }
   else
   {
      FIRST=0;
      print $1 OFS $2;
   }
}
else
{
   print OFS $2;
}

PVAL=$1;
}'

exit 0

ray.wurlod · Post by **ray.wurlod** » Tue Jan 03, 2006 3:54 pm

I don't recall anyone claiming that Pivot stage was faster than awk.

Ultramundane · Post by **Ultramundane** » Tue Jan 03, 2006 4:02 pm

I believe Pavan originally stated that an awk script he wrote was slow and that awk was too slow and that he wanted to code this in Datastage because it is faster?

vmcburney · Post by **vmcburney** » Tue Jan 03, 2006 4:48 pm

With parallel processing DataStage should be faster then an awk script.

If you have a static vertical pivot, where you know how many rows belong to each key, then it is easy to do in a parallel job. If you have a dynamic number of rows for each key it becomes quite difficult. In the example link provided by ameyvaidya there was a particular value that indicated the end of a set of keys that provided the output trigger so this was a predictable dynamic vertical pivot.

A transformer vertical pivot works by outputting a row after all key rows have been collected, you only know you have all the rows when you find a row with a new key. This means you actually output previous row data, not the current row data.

When you get to the end of the output you need a dummy row so you can output the last key combination. In DataStage there is no easy way to determine that you are at the end of the data so the dummy row becomes necessary. In a parallel job a dummy row would only be delivered to one of the transformer partitions and the others would miss out.

Now if you have a static number of rows per key then you could simply use a counter to tell you when to output a row. Let's assume you are trying to vertically pivot 2 rows at a time. This means you can output every second row.

svKeyCount if input.custid = svPreviousCustId then svKeyCount + 1 else 1
svPreviousCustId input.custid
svCustValues if svKeyCount = 1 then input.custcode else svCustValues : "," : NullToValue(input.custcode, ' ')

Your constraint is to only output a row when svKeyCount = 2. You output input.custid and svCustValues. svCustValues collects your data across the two rows.

Ultramundane · Post by **Ultramundane** » Wed Jan 04, 2006 8:11 am

It might be faster with the vertical pivot, but with the horizontal pivot (even in parallel), awk is still 20 times faster. I have a case open with Ascential on the terrible performance of this stage. But this is a different issue. horizontal vs. vertical. Or, maybe it isn't. Since the stage is called just pivot, it should have functionality to allow for both flavors.

ray.wurlod · Post by **ray.wurlod** » Wed Jan 04, 2006 3:45 pm

Ultramundane wrote:I believe Pavan originally stated that an awk script he wrote was slow and that awk was too slow and that he wanted to code this in Datastage because it is faster?

No, I've re-read the original post. He only asked whether it is possible in DataStage. He did claim that awk was too slow, but made no claim about DataStage being faster.

Ultramundane · Post by **Ultramundane** » Wed Jan 04, 2006 3:52 pm

I think one can reasonably deduce that the below statement did have implications/claims about Datastage being faster.

"I did this once using awk,but its too slow for huge volumes of data. So i want this to be done in datastage."

kcbland · Post by **kcbland** » Wed Jan 04, 2006 4:01 pm

Tie-breaker comment, I read it the same way as Ultramundane. Hopefully a user wouldn't be looking for a slower solution

using DataStage, therefore, DataStage must be the faster solution.

The proof is in the pudding, but I would hazard a guess that PX with parallel capabilities should tackle the pivot faster than awk. I might even go as far as betting a box of donuts on it, but just the glazed ones, not the chocolate with the sprinkles on top.

vmcburney · Post by **vmcburney** » Wed Jan 04, 2006 5:22 pm

Ultramundane wrote:It might be faster with the vertical pivot, but with the horizontal pivot (even in parallel), awk is still 20 times faster. I have a case open with Ascential on the terrible performance of this stage. But this is a different issue. horizontal vs. vertical. Or, maybe it isn't. Since the stage is called just pivot, it should have functionality to allow for both flavors.

Thanks for the tip. Vertical pivoting in the transformer stage has a small overhead in using the compiled stage. It is extremely efficient in that it only collects and holds a small amount of information in memory and refreshes that information with each key change. The pre-sorting does the hard work and a parallel sort is quite fast. It can deliver the rows to subsequent stages as it pivots which may give it the edge over awk.

Ultramundane · Post by **Ultramundane** » Thu Jan 05, 2006 2:38 pm

vmcburney wrote:Thanks for the tip. Vertical pivoting in the transformer stage has a small overhead in using the compiled stage. It is extremely efficient in that it only collects and holds a small amount of information in memory and refreshes that information with each key change. The pre-sorting does the hard work and a parallel sort is quite fast. It can deliver the rows to subsequent stages as it pivots which may give it the edge over awk.

I tried the method for the vertical pivot and the performance is about equal on my system. Awk was actually slightly faster, but that is probably due to my system only having two cpus. Others would likely see an increase in performance with more nodes and cpus.

I also wanted to come with a solution for horizontally pivoting in DS without the pivot stage. (I am sure there are many, but I am no expert).

I found the below solution to perform about as well as awk and about 20 times faster than the built-in pivot stage on my system.

One Source to read the key and each column to be pivoted about the key. Send output of each source to an Intermediate Independant Transformer to remap the column names and change datatypes if necessary.
Send all output from transformers to an Intermediate Funnel then to an Intermediate sort and finally to the Target.

Source -> Transformer
=> Funnel -> Sort -> Target
Source -> Transformer