Page 1 of 1

Using Stage variables to compare 2 records

Posted: Sat Jun 26, 2010 10:05 am
by vskr72
I have a data from src like this:

Code: Select all

ColA, ColB 
 100, ABC
 100, ABC
 100, DEF
 100, DEF
I want to tag a new column like below:

Code: Select all

ColA, ColB, ColC
 100, ABC,   1
 100, ABC,   0
 100, DEF,   1
 100, DEF,   0
I know that this can be acheived through Sort stage with the CreateKeyChange option. There are more then 50 Mil records in Src, so, I was trying to avoid Sort stage. But, I would like to use the transformer stage to acheive this. Pls let me know if this is possible. Thanks,

Satish

Posted: Sat Jun 26, 2010 2:04 pm
by ray.wurlod
You can not avoid sorting data unless the data are already sorted, because you must first have the keys grouped. Unless you want to compare "this record with the previous record" irrespective of keys.

In either case, set up two stage variables. One compares the current row with the previous row, and the other (later in the grid) stores the value of the current row which becomes the "previous" row when the next row is being processed. You also need to handle the special case of row #1, which is always different from the "previous" row because the latter does not exist.

Posted: Sat Jun 26, 2010 2:18 pm
by Sainath.Srinivasan
How many unique values do you expect ?

You can achieve the result directly from your database (if it comes from one).

There are other ways if it is from non-db type.

Posted: Sat Jun 26, 2010 2:38 pm
by chulett
Still, based on your example it will need to be sorted by something, be it the Sort stage or your source (if a database) or something on the command line. If you do it in your source database, then you can set the stage to "Don't sort, already sorted" but still use it to create the column you need.

Posted: Sat Jun 26, 2010 3:06 pm
by Sainath.Srinivasan
What I meant was, if the db can handle, we can obtain the whole result from one query. Nothing to be done in DataStage.

Posted: Sat Jun 26, 2010 4:08 pm
by chulett
True, if the proper analytical functions available, but it's simple enough to do it in the tool as well.

Posted: Sat Jun 26, 2010 4:17 pm
by vskr72
The reason, I do not want to handle the comparision using DB is the volume is huge. Day 0 load will be close to 80 mil and every 5 mil incrementally. But, I will sort it using DB. Will try using Stage variabls and see how it goes.

Posted: Sun Jun 27, 2010 3:37 am
by ray.wurlod
If the column(s) by which the data are to be sorted or indexed using a B-tree index, then sorting is at no cost because a B-tree index is stored in sorted order.