Maintain sort order in datastage

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
vij
Participant
Posts: 131
Joined: Fri Nov 17, 2006 12:43 am

Maintain sort order in datastage

Post by vij »

Hi,

I have a question regarding the order of records in a file and how the order is handled by ETL.

I have a ETL job which uses a file as source and the job has to remove duplicate based on the account number column and retain the last record. Records which have the same account number carry different information for other columns, so it is important to retain last record and store it in tables.

Is there a way that I can make sure that always the last record (as present in the file) is selected? If I sort the input records first and apply a hash partition using account number as key will it take care of preserving the order or will the ETL retain any record of its choice? Can this be controlled by the ETL code? Is this correct?

Thanks in advance,
Vij
kiran259
Participant
Posts: 48
Joined: Thu Aug 16, 2007 11:17 pm
Location: United States
Contact:

Post by kiran259 »

Hash partition changes the sort order basing on the hash algorithm.Same partitioning maintains the sort order to the next stage.
Kiran Vaduguri

As soon as the fear approaches near, attack and destroy it.
anand_dafaria
Premium Member
Premium Member
Posts: 36
Joined: Wed Dec 27, 2006 3:50 am
Location: Phoenix

Post by anand_dafaria »

yes you can easily control the sorting order in an ETL job.
You can use sort stage for sorting as well as retaining the last record. But before that you need to know which record comes in the last. Consider and example:
EMPLOYEE_NO DEPT_ID
117514 123
117514 456


Now you have to see which record you need to consider, Employee with DEPT_ID 123 or 456. So you need another sorting key which will decide the last record. Suppose you want the employee with 456 to be in the output then in the sort stage put EMPLOYEE_ID---SORT(Ascending) and DEPT_ID-SORT(descending) and HASH partition on Only EMPLOYEE_ID. This will ensure same employee in the same partition and EMPLOYEE with the highest value of DEPT_ID in the first row. In the sort stage you can specify Allow Diplicates to tru and Stable Sort to True. this will ensure out of the duplicates on the EMPLOYEE_ID first record is retained.

Just apply this situation with yours and check out.

Thanks
Anand..
vij
Participant
Posts: 131
Joined: Fri Nov 17, 2006 12:43 am

Post by vij »

hi,

thanks for the quick response, I appreciate it.

As per your example, there are 2 columns - EMPLOYEE_NO and DEPT_ID. But in my case, I have only the account number column and need to maintain the sort order as I receive from the UNIX file.

Is it possible?

Thanks again,
vij
Sainath.Srinivasan
Participant
Posts: 3337
Joined: Mon Jan 17, 2005 4:49 am
Location: United Kingdom

Post by Sainath.Srinivasan »

Stable sort !!??
priyadarshikunal
Premium Member
Premium Member
Posts: 1735
Joined: Thu Mar 01, 2007 5:44 am
Location: Troy, MI

Post by priyadarshikunal »

No.

Either you need a timestamp to findout which is the last record or do the following.

Sequential read (no multiple node and no multiple readers) then a sequential transformer which should assign a count to each record. after that you are free to sort it on account number and the derived count column.
Priyadarshi Kunal

Genius may have its limitations, but stupidity is not thus handicapped. :wink:
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

You can't "sort" if you need to maintain the original file order. Are your account numbers already sorted properly, all like values contiguous? If so, I'd think the Remove Duplicates stage should do the trick.
-craig

"You can never have too many knives" -- Logan Nine Fingers
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Beware of doing that. DataStage will insert a tsort operator that may thwart your best-laid plans. Always use a Sort stage to specify "don't sort, previously sorted" if that's what you need to do.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

... or add the APT variable that shuts down sneaky tsort insertions, yes?
-craig

"You can never have too many knives" -- Logan Nine Fingers
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Yes.

But you're not allowed to lie. DataStage checks at run time!
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

True. :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
Post Reply