Page 1 of 1

Maintain sort order in datastage

Posted: Tue Sep 08, 2009 9:36 pm
by vij
Hi,

I have a question regarding the order of records in a file and how the order is handled by ETL.

I have a ETL job which uses a file as source and the job has to remove duplicate based on the account number column and retain the last record. Records which have the same account number carry different information for other columns, so it is important to retain last record and store it in tables.

Is there a way that I can make sure that always the last record (as present in the file) is selected? If I sort the input records first and apply a hash partition using account number as key will it take care of preserving the order or will the ETL retain any record of its choice? Can this be controlled by the ETL code? Is this correct?

Thanks in advance,
Vij

Posted: Tue Sep 08, 2009 10:23 pm
by kiran259
Hash partition changes the sort order basing on the hash algorithm.Same partitioning maintains the sort order to the next stage.

Posted: Tue Sep 08, 2009 10:25 pm
by anand_dafaria
yes you can easily control the sorting order in an ETL job.
You can use sort stage for sorting as well as retaining the last record. But before that you need to know which record comes in the last. Consider and example:
EMPLOYEE_NO DEPT_ID
117514 123
117514 456


Now you have to see which record you need to consider, Employee with DEPT_ID 123 or 456. So you need another sorting key which will decide the last record. Suppose you want the employee with 456 to be in the output then in the sort stage put EMPLOYEE_ID---SORT(Ascending) and DEPT_ID-SORT(descending) and HASH partition on Only EMPLOYEE_ID. This will ensure same employee in the same partition and EMPLOYEE with the highest value of DEPT_ID in the first row. In the sort stage you can specify Allow Diplicates to tru and Stable Sort to True. this will ensure out of the duplicates on the EMPLOYEE_ID first record is retained.

Just apply this situation with yours and check out.

Thanks
Anand..

Posted: Tue Sep 08, 2009 11:08 pm
by vij
hi,

thanks for the quick response, I appreciate it.

As per your example, there are 2 columns - EMPLOYEE_NO and DEPT_ID. But in my case, I have only the account number column and need to maintain the sort order as I receive from the UNIX file.

Is it possible?

Thanks again,
vij

Posted: Wed Sep 09, 2009 2:34 am
by Sainath.Srinivasan
Stable sort !!??

Posted: Wed Sep 09, 2009 3:56 am
by priyadarshikunal
No.

Either you need a timestamp to findout which is the last record or do the following.

Sequential read (no multiple node and no multiple readers) then a sequential transformer which should assign a count to each record. after that you are free to sort it on account number and the derived count column.

Posted: Wed Sep 09, 2009 5:26 am
by chulett
You can't "sort" if you need to maintain the original file order. Are your account numbers already sorted properly, all like values contiguous? If so, I'd think the Remove Duplicates stage should do the trick.

Posted: Wed Sep 09, 2009 8:29 pm
by ray.wurlod
Beware of doing that. DataStage will insert a tsort operator that may thwart your best-laid plans. Always use a Sort stage to specify "don't sort, previously sorted" if that's what you need to do.

Posted: Wed Sep 09, 2009 9:04 pm
by chulett
... or add the APT variable that shuts down sneaky tsort insertions, yes?

Posted: Wed Sep 09, 2009 9:05 pm
by ray.wurlod
Yes.

But you're not allowed to lie. DataStage checks at run time!

Posted: Wed Sep 09, 2009 9:06 pm
by chulett
True. :wink: