comparison between 2 rows

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
Bilwakunj
Participant
Posts: 59
Joined: Fri Sep 10, 2004 7:00 am

comparison between 2 rows

Post by Bilwakunj »

Hi,
I've a data like this:
case 11
Date1=11/7/2004
Date2 = 11/19/2004
case 11
Date1=11/17/2004
Date2 = 11/29/2004
case 11
Date1=12/7/2004
Date2 = 10/19/2004
case 22
Date1=02/7/2003
Date2 = 10/19/2004

This is how the data is present in the dataset, Now I need to compare Date1 of each record with the Date1 of the remaining to find out which is the earliest one. and that too those have same case, if the case number changes again the fresh comparison should start among related records of that category.
I'm thinking of first dumping this data into an oracle table and with PL/SQL INDEX BY table method comparing it with array like manner. But in that case I've the penaulty of I/O operation as well as I'm loosing the "parallel processing". Is there any better approach?

Thanks in advance.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

At a quick glance this is something where the Aggregator stage might be the one you need. Read the help or manual for Aggregator stage and see whether you agree.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
T42
Participant
Posts: 499
Joined: Thu Nov 11, 2004 6:45 pm

Post by T42 »

Sounds like an interesting challenge. You will most likely need to split the process into two separate stream, obtain your necessary information on the alternative stream, and then match the resulting information with the main stream.

Ray's suggestion of the Aggregation Stage is also my suggestion too.
T42
Participant
Posts: 499
Joined: Thu Nov 11, 2004 6:45 pm

Post by T42 »

Whoops. Duplicate posts. :oops:
Last edited by T42 on Sun Mar 27, 2005 12:14 pm, edited 1 time in total.
Inquisitive
Charter Member
Charter Member
Posts: 88
Joined: Tue Jan 13, 2004 3:07 pm

Post by Inquisitive »

It is quite Interesting, My solution would be:

If your File is going to be Uniform as u mentioned, with sme structure all along, Use sed command to organise it as follows:

Case 11 Date1 = <value> date2 = <value>

i.e concatenation of each 3 rows in a single row.

Then Read it as a sequential file and use Aggregator dtage to find the Max Date for each case, with group by clause.

In the same lines as Ray's
T42
Participant
Posts: 499
Joined: Thu Nov 11, 2004 6:45 pm

Post by T42 »

One other solution:

Use:

[Sort] -> [Transformer]

Sort the key in order of case (ascending), and date (ascending).

Use Transformer's Stage Variable to retain the current key. Initialize it to 0 (or some value that will never happens), and compare it with the case. If it's different, provide the new value, AND reset the date value with the current value. Otherwise, don't change the date value.

Meanwhile, pass on the date as the 'lowest' date to some field on output. After all, the first record for every case have the lowest date you wanted (thanks to the sorting you did.)

Since the Aggregator stage require sorting (and WILL do it for you unless you use a sort stage and tell it NOT to sort), this solution most likely will be faster.
Bilwakunj
Participant
Posts: 59
Joined: Fri Sep 10, 2004 7:00 am

Post by Bilwakunj »

Thanks guys for your valuable help!!!
But I couldn't get very clearly this solution.. can you please explain with some minimal code please..

T42 wrote:One other solution:

Use:

[Sort] -> [Transformer]

Sort the key in order of case (ascending), and date (ascending).

Use Transformer's Stage Variable to retain the current key. Initialize it to 0 (or some value that will never happens), and compare it with the case. If it's different, provide the new value, AND reset the date value with the current value. Otherwise, don't change the date value.

Meanwhile, pass on the date as the 'lowest' date to some field on output. After all, the first record for every case have the lowest date you wanted (thanks to the sorting you did.)

Since the Aggregator stage require sorting (and WILL do it for you unless you use a sort stage and tell it NOT to sort), this solution most likely will be faster.
davidnemirovsky
Participant
Posts: 85
Joined: Fri Jun 04, 2004 2:30 am
Location: Melbourne, Australia
Contact:

Post by davidnemirovsky »

I might go with T42's sort solution. The conceptual design makes sense to me and T42's description should be all that you need to develop this solution. If T42's solution doesn't make sense then I would suggest some training (ask AscentialSoftware).
Cheers,
Dave Nemirovsky
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

Isn't this a straight forward remove duplicates problem? Sort the data by case and date and then send it through the remove duplicates stage with case as key keeping the FIRST record.

The stage variable approach is good if you need to output all records and set a flag field indicating the earliest record. The remove duplicates stage is good if you only want to keep the earliest record.
Post Reply