comparison between 2 rows
Moderators: chulett, rschirm, roy
comparison between 2 rows
Hi,
I've a data like this:
case 11
Date1=11/7/2004
Date2 = 11/19/2004
case 11
Date1=11/17/2004
Date2 = 11/29/2004
case 11
Date1=12/7/2004
Date2 = 10/19/2004
case 22
Date1=02/7/2003
Date2 = 10/19/2004
This is how the data is present in the dataset, Now I need to compare Date1 of each record with the Date1 of the remaining to find out which is the earliest one. and that too those have same case, if the case number changes again the fresh comparison should start among related records of that category.
I'm thinking of first dumping this data into an oracle table and with PL/SQL INDEX BY table method comparing it with array like manner. But in that case I've the penaulty of I/O operation as well as I'm loosing the "parallel processing". Is there any better approach?
Thanks in advance.
I've a data like this:
case 11
Date1=11/7/2004
Date2 = 11/19/2004
case 11
Date1=11/17/2004
Date2 = 11/29/2004
case 11
Date1=12/7/2004
Date2 = 10/19/2004
case 22
Date1=02/7/2003
Date2 = 10/19/2004
This is how the data is present in the dataset, Now I need to compare Date1 of each record with the Date1 of the remaining to find out which is the earliest one. and that too those have same case, if the case number changes again the fresh comparison should start among related records of that category.
I'm thinking of first dumping this data into an oracle table and with PL/SQL INDEX BY table method comparing it with array like manner. But in that case I've the penaulty of I/O operation as well as I'm loosing the "parallel processing". Is there any better approach?
Thanks in advance.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
At a quick glance this is something where the Aggregator stage might be the one you need. Read the help or manual for Aggregator stage and see whether you agree.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Sounds like an interesting challenge. You will most likely need to split the process into two separate stream, obtain your necessary information on the alternative stream, and then match the resulting information with the main stream.
Ray's suggestion of the Aggregation Stage is also my suggestion too.
Ray's suggestion of the Aggregation Stage is also my suggestion too.
-
- Charter Member
- Posts: 88
- Joined: Tue Jan 13, 2004 3:07 pm
It is quite Interesting, My solution would be:
If your File is going to be Uniform as u mentioned, with sme structure all along, Use sed command to organise it as follows:
Case 11 Date1 = <value> date2 = <value>
i.e concatenation of each 3 rows in a single row.
Then Read it as a sequential file and use Aggregator dtage to find the Max Date for each case, with group by clause.
In the same lines as Ray's
If your File is going to be Uniform as u mentioned, with sme structure all along, Use sed command to organise it as follows:
Case 11 Date1 = <value> date2 = <value>
i.e concatenation of each 3 rows in a single row.
Then Read it as a sequential file and use Aggregator dtage to find the Max Date for each case, with group by clause.
In the same lines as Ray's
One other solution:
Use:
[Sort] -> [Transformer]
Sort the key in order of case (ascending), and date (ascending).
Use Transformer's Stage Variable to retain the current key. Initialize it to 0 (or some value that will never happens), and compare it with the case. If it's different, provide the new value, AND reset the date value with the current value. Otherwise, don't change the date value.
Meanwhile, pass on the date as the 'lowest' date to some field on output. After all, the first record for every case have the lowest date you wanted (thanks to the sorting you did.)
Since the Aggregator stage require sorting (and WILL do it for you unless you use a sort stage and tell it NOT to sort), this solution most likely will be faster.
Use:
[Sort] -> [Transformer]
Sort the key in order of case (ascending), and date (ascending).
Use Transformer's Stage Variable to retain the current key. Initialize it to 0 (or some value that will never happens), and compare it with the case. If it's different, provide the new value, AND reset the date value with the current value. Otherwise, don't change the date value.
Meanwhile, pass on the date as the 'lowest' date to some field on output. After all, the first record for every case have the lowest date you wanted (thanks to the sorting you did.)
Since the Aggregator stage require sorting (and WILL do it for you unless you use a sort stage and tell it NOT to sort), this solution most likely will be faster.
Thanks guys for your valuable help!!!
But I couldn't get very clearly this solution.. can you please explain with some minimal code please..
But I couldn't get very clearly this solution.. can you please explain with some minimal code please..
T42 wrote:One other solution:
Use:
[Sort] -> [Transformer]
Sort the key in order of case (ascending), and date (ascending).
Use Transformer's Stage Variable to retain the current key. Initialize it to 0 (or some value that will never happens), and compare it with the case. If it's different, provide the new value, AND reset the date value with the current value. Otherwise, don't change the date value.
Meanwhile, pass on the date as the 'lowest' date to some field on output. After all, the first record for every case have the lowest date you wanted (thanks to the sorting you did.)
Since the Aggregator stage require sorting (and WILL do it for you unless you use a sort stage and tell it NOT to sort), this solution most likely will be faster.
-
- Participant
- Posts: 85
- Joined: Fri Jun 04, 2004 2:30 am
- Location: Melbourne, Australia
- Contact:
-
- Participant
- Posts: 3593
- Joined: Thu Jan 23, 2003 5:25 pm
- Location: Australia, Melbourne
- Contact:
Isn't this a straight forward remove duplicates problem? Sort the data by case and date and then send it through the remove duplicates stage with case as key keeping the FIRST record.
The stage variable approach is good if you need to output all records and set a flag field indicating the earliest record. The remove duplicates stage is good if you only want to keep the earliest record.
The stage variable approach is good if you need to output all records and set a flag field indicating the earliest record. The remove duplicates stage is good if you only want to keep the earliest record.
Certus Solutions
Blog: Tooling Around in the InfoSphere
Twitter: @vmcburney
LinkedIn:Vincent McBurney LinkedIn
Blog: Tooling Around in the InfoSphere
Twitter: @vmcburney
LinkedIn:Vincent McBurney LinkedIn