Get records with MAXIMUM value in a timestamp column
Posted: Tue Oct 30, 2012 1:09 am
Hi there ,
I am having a set of data with three columns :
1. MEMBER (varchar)
2. PHONE (varchar)
3. UPDATE_TIME (timestamp)
The data looks something like this below :
M1 P1 T1
M1 P1 T7
M1 P1 T1
M1 P1 T1
M2 P2 T8
M2 P2 T2
M2 P2 T6
M2 P2 T2
where M's stand for MEMBER, P's stand for PHONE and T's stand for UPDATE_TIME and T1>T2>T6>T7>T8
What is expected is to create a Datastage job(no other tool like PL/SQL etc. will be accepted by business) that will implement the following logic:
1. For EACH MEMBER-PHONE combination, it will get ALL Records which has UPDATE_TIME=MAX of UPDATE_TIME for that MEMBER-PHONE combination, and insert these records in one table
2. Any other records thereafter, will be considered as DUPLICATE's , and for each of these records there will be another column called DUPCHECK which will be set to "DUPLICATE"
The outcome for the above mentioned data will be something like
Good records:
M1 P1 T1
M1 P1 T1
M1 P1 T1
M2 P2 T2
M2 P2 T2
Duplicate records:
M1 P1 T7 DUPLICATE
M2 P2 T6 DUPLICATE
M2 P2 T8 DUPLICATE
Please note, we needs to capture all records, none of them could be dropped/rejected, and I have already used the Remove Duplicate stage with Join stage to implement this logic,
but it is taking a hell lot of time.
1.Cant this logic be acheived using SORT stage only
2.There will be millions of data , so a very optimized approach will be highly appreciated.
Thanks
Kumarjit.
I am having a set of data with three columns :
1. MEMBER (varchar)
2. PHONE (varchar)
3. UPDATE_TIME (timestamp)
The data looks something like this below :
M1 P1 T1
M1 P1 T7
M1 P1 T1
M1 P1 T1
M2 P2 T8
M2 P2 T2
M2 P2 T6
M2 P2 T2
where M's stand for MEMBER, P's stand for PHONE and T's stand for UPDATE_TIME and T1>T2>T6>T7>T8
What is expected is to create a Datastage job(no other tool like PL/SQL etc. will be accepted by business) that will implement the following logic:
1. For EACH MEMBER-PHONE combination, it will get ALL Records which has UPDATE_TIME=MAX of UPDATE_TIME for that MEMBER-PHONE combination, and insert these records in one table
2. Any other records thereafter, will be considered as DUPLICATE's , and for each of these records there will be another column called DUPCHECK which will be set to "DUPLICATE"
The outcome for the above mentioned data will be something like
Good records:
M1 P1 T1
M1 P1 T1
M1 P1 T1
M2 P2 T2
M2 P2 T2
Duplicate records:
M1 P1 T7 DUPLICATE
M2 P2 T6 DUPLICATE
M2 P2 T8 DUPLICATE
Please note, we needs to capture all records, none of them could be dropped/rejected, and I have already used the Remove Duplicate stage with Join stage to implement this logic,
but it is taking a hell lot of time.
1.Cant this logic be acheived using SORT stage only
2.There will be millions of data , so a very optimized approach will be highly appreciated.
Thanks
Kumarjit.