comparison between 2 rows

Bilwakunj · Post by **Bilwakunj** » Fri Mar 25, 2005 1:36 am

Hi,
I've a data like this:
case 11
Date1=11/7/2004
Date2 = 11/19/2004
case 11
Date1=11/17/2004
Date2 = 11/29/2004
case 11
Date1=12/7/2004
Date2 = 10/19/2004
case 22
Date1=02/7/2003
Date2 = 10/19/2004

This is how the data is present in the dataset, Now I need to compare Date1 of each record with the Date1 of the remaining to find out which is the earliest one. and that too those have same case, if the case number changes again the fresh comparison should start among related records of that category.
I'm thinking of first dumping this data into an oracle table and with PL/SQL INDEX BY table method comparing it with array like manner. But in that case I've the penaulty of I/O operation as well as I'm loosing the "parallel processing". Is there any better approach?

Thanks in advance.

ray.wurlod · Post by **ray.wurlod** » Fri Mar 25, 2005 4:07 pm

At a quick glance this is something where the Aggregator stage might be the one you need. Read the help or manual for Aggregator stage and see whether you agree.

T42 · Post by **T42** » Fri Mar 25, 2005 10:15 pm

Sounds like an interesting challenge. You will most likely need to split the process into two separate stream, obtain your necessary information on the alternative stream, and then match the resulting information with the main stream.

Ray's suggestion of the Aggregation Stage is also my suggestion too.

T42 · Post by **T42** » Fri Mar 25, 2005 10:16 pm

Whoops. Duplicate posts.

Inquisitive · Post by **Inquisitive** » Sat Mar 26, 2005 5:32 pm

It is quite Interesting, My solution would be:

If your File is going to be Uniform as u mentioned, with sme structure all along, Use sed command to organise it as follows:

Case 11 Date1 = <value> date2 = <value>

i.e concatenation of each 3 rows in a single row.

Then Read it as a sequential file and use Aggregator dtage to find the Max Date for each case, with group by clause.

In the same lines as Ray's

T42 · Post by **T42** » Sun Mar 27, 2005 12:35 pm

One other solution:

Use:

[Sort] -> [Transformer]

Sort the key in order of case (ascending), and date (ascending).

Use Transformer's Stage Variable to retain the current key. Initialize it to 0 (or some value that will never happens), and compare it with the case. If it's different, provide the new value, AND reset the date value with the current value. Otherwise, don't change the date value.

Meanwhile, pass on the date as the 'lowest' date to some field on output. After all, the first record for every case have the lowest date you wanted (thanks to the sorting you did.)

Since the Aggregator stage require sorting (and WILL do it for you unless you use a sort stage and tell it NOT to sort), this solution most likely will be faster.

Bilwakunj · Post by **Bilwakunj** » Wed Mar 30, 2005 12:10 am

Thanks guys for your valuable help!!!
But I couldn't get very clearly this solution.. can you please explain with some minimal code please..

T42 wrote:One other solution:

Use:

[Sort] -> [Transformer]

Sort the key in order of case (ascending), and date (ascending).

Use Transformer's Stage Variable to retain the current key. Initialize it to 0 (or some value that will never happens), and compare it with the case. If it's different, provide the new value, AND reset the date value with the current value. Otherwise, don't change the date value.

Meanwhile, pass on the date as the 'lowest' date to some field on output. After all, the first record for every case have the lowest date you wanted (thanks to the sorting you did.)

Since the Aggregator stage require sorting (and WILL do it for you unless you use a sort stage and tell it NOT to sort), this solution most likely will be faster.

davidnemirovsky · Post by **davidnemirovsky** » Wed Mar 30, 2005 1:19 am

I might go with T42's sort solution. The conceptual design makes sense to me and T42's description should be all that you need to develop this solution. If T42's solution doesn't make sense then I would suggest some training (ask AscentialSoftware).

vmcburney · Post by **vmcburney** » Wed Mar 30, 2005 5:50 pm

Isn't this a straight forward remove duplicates problem? Sort the data by case and date and then send it through the remove duplicates stage with case as key keeping the FIRST record.

The stage variable approach is good if you need to output all records and set a flag field indicating the earliest record. The remove duplicates stage is good if you only want to keep the earliest record.