Maintain sort order in datastage
Moderators: chulett, rschirm, roy
Maintain sort order in datastage
Hi,
I have a question regarding the order of records in a file and how the order is handled by ETL.
I have a ETL job which uses a file as source and the job has to remove duplicate based on the account number column and retain the last record. Records which have the same account number carry different information for other columns, so it is important to retain last record and store it in tables.
Is there a way that I can make sure that always the last record (as present in the file) is selected? If I sort the input records first and apply a hash partition using account number as key will it take care of preserving the order or will the ETL retain any record of its choice? Can this be controlled by the ETL code? Is this correct?
Thanks in advance,
Vij
I have a question regarding the order of records in a file and how the order is handled by ETL.
I have a ETL job which uses a file as source and the job has to remove duplicate based on the account number column and retain the last record. Records which have the same account number carry different information for other columns, so it is important to retain last record and store it in tables.
Is there a way that I can make sure that always the last record (as present in the file) is selected? If I sort the input records first and apply a hash partition using account number as key will it take care of preserving the order or will the ETL retain any record of its choice? Can this be controlled by the ETL code? Is this correct?
Thanks in advance,
Vij
-
- Premium Member
- Posts: 36
- Joined: Wed Dec 27, 2006 3:50 am
- Location: Phoenix
yes you can easily control the sorting order in an ETL job.
You can use sort stage for sorting as well as retaining the last record. But before that you need to know which record comes in the last. Consider and example:
EMPLOYEE_NO DEPT_ID
117514 123
117514 456
Now you have to see which record you need to consider, Employee with DEPT_ID 123 or 456. So you need another sorting key which will decide the last record. Suppose you want the employee with 456 to be in the output then in the sort stage put EMPLOYEE_ID---SORT(Ascending) and DEPT_ID-SORT(descending) and HASH partition on Only EMPLOYEE_ID. This will ensure same employee in the same partition and EMPLOYEE with the highest value of DEPT_ID in the first row. In the sort stage you can specify Allow Diplicates to tru and Stable Sort to True. this will ensure out of the duplicates on the EMPLOYEE_ID first record is retained.
Just apply this situation with yours and check out.
Thanks
Anand..
You can use sort stage for sorting as well as retaining the last record. But before that you need to know which record comes in the last. Consider and example:
EMPLOYEE_NO DEPT_ID
117514 123
117514 456
Now you have to see which record you need to consider, Employee with DEPT_ID 123 or 456. So you need another sorting key which will decide the last record. Suppose you want the employee with 456 to be in the output then in the sort stage put EMPLOYEE_ID---SORT(Ascending) and DEPT_ID-SORT(descending) and HASH partition on Only EMPLOYEE_ID. This will ensure same employee in the same partition and EMPLOYEE with the highest value of DEPT_ID in the first row. In the sort stage you can specify Allow Diplicates to tru and Stable Sort to True. this will ensure out of the duplicates on the EMPLOYEE_ID first record is retained.
Just apply this situation with yours and check out.
Thanks
Anand..
-
- Participant
- Posts: 3337
- Joined: Mon Jan 17, 2005 4:49 am
- Location: United Kingdom
-
- Premium Member
- Posts: 1735
- Joined: Thu Mar 01, 2007 5:44 am
- Location: Troy, MI
No.
Either you need a timestamp to findout which is the last record or do the following.
Sequential read (no multiple node and no multiple readers) then a sequential transformer which should assign a count to each record. after that you are free to sort it on account number and the derived count column.
Either you need a timestamp to findout which is the last record or do the following.
Sequential read (no multiple node and no multiple readers) then a sequential transformer which should assign a count to each record. after that you are free to sort it on account number and the derived count column.
Priyadarshi Kunal
Genius may have its limitations, but stupidity is not thus handicapped.
Genius may have its limitations, but stupidity is not thus handicapped.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Beware of doing that. DataStage will insert a tsort operator that may thwart your best-laid plans. Always use a Sort stage to specify "don't sort, previously sorted" if that's what you need to do.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact: