Is Sort stage before Remove-Duplicate stage mandatory?

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
Amit Jaiswal
Premium Member
Premium Member
Posts: 38
Joined: Fri Apr 22, 2005 6:07 am

Is Sort stage before Remove-Duplicate stage mandatory?

Post by Amit Jaiswal »

Hi All,
We have tested Remove-Duplicate stage without using any Sort Stage before it. It is working fine and gives expected results. However, in DS help, it is specifically given that data should be presorted and Sort Stage should be used before using Remove-Duplicate stage. During development we are using only one node and partition type is Auto. Will Remove-Duplicate stage without presorted data, cause any issue with more number of nodes and with other type of partitions?
Thanks in advance.
-Amit
cmmurari
Participant
Posts: 34
Joined: Sun Jan 02, 2005 9:55 am
Location: Singapore

Post by cmmurari »

Basic rule is input stream should be sorted. pls go through parallel jobs developer's guide.


cheers,
krish
roy
Participant
Posts: 2598
Joined: Wed Jul 30, 2003 2:05 am
Location: Israel

Post by roy »

Hi,
If you know the input stream is already sorted (like a db stage using order by) then you might not need to sort the data again.
The same for select statements that comes sorted naturally as Oracle some times give.
In case your not sure then you need the sort stage, using the RD stage's link option for sorting the data won't do!

IHTH,
Roy R.
Time is money but when you don't have money time is all you can afford.

Search before posting:)

Join the DataStagers team effort at:
http://www.worldcommunitygrid.org
Image
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Hi,
May be the reason that you might have sorted the data in some earliar stage or streams.



BTW:
using the RD stage's link option for sorting the data won't do
may i know why this happens, if so what is the difference between a explicit sort stage operation and the presort....

regards
kumar
aartlett
Charter Member
Charter Member
Posts: 152
Joined: Fri Apr 23, 2004 6:44 pm
Location: Australia

Post by aartlett »

For reasonable amounts of data (< 2gb) I've always been partial to a sort -u before the job :)

However, as people probably have noticed by now, I'm not a datastage purist, I think there are other ways of doing things.

Andrew the Heretic
Andrew

Think outside the Datastage you work in.

There is no True Way, but there are true ways.
ag_ram
Premium Member
Premium Member
Posts: 524
Joined: Wed Feb 28, 2007 3:51 am

Post by ag_ram »

hi

The sorting before the remove duplicates is necessary. In your case it is working because the auto partitioning is taking care of the sorting. it is doing an inline sort on the keys based on which it is doing the remove duplicates.

Thanks,
Ram.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

It's not partitioning. Look at the score. Note that DataStage has inserted some tsort operators (and probably some buffer operators also) on the inputs. So, if you don't specify sorting, DataStage will insert sorting. You might prefer a Sort stage so you can tell it "don't sort (previously sorted)" explicitly.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
JoshGeorge
Participant
Posts: 612
Joined: Thu May 03, 2007 4:59 am
Location: Melbourne

Post by JoshGeorge »

Set APT_NO_SORT_INSERTION to True in your job and run to see the difference.
Joshy George
<a href="http://www.linkedin.com/in/joshygeorge1" ><img src="http://www.linkedin.com/img/webpromo/bt ... _80x15.gif" width="80" height="15" border="0"></a>
keshav0307
Premium Member
Premium Member
Posts: 783
Joined: Mon Jan 16, 2006 10:17 pm
Location: Sydney, Australia

Post by keshav0307 »

For remove dulpicate, use Hash partitioning on the Key columns.
not sure, but Auto partition will distribute the records in round robin manner, so , you may still get duplicate output, if you are using more then one 1 node.
JoshGeorge
Participant
Posts: 612
Joined: Thu May 03, 2007 4:59 am
Location: Melbourne

Post by JoshGeorge »

You don't have to do an explicit 'Hash partitioning' ! See the score and you can see why. If you include

$APT_NO_PART_INSERTION = True
$APT_NO_SORT_INSERTION = True

and run the job and see the score. You can see the difference.

Datastage inserts what is required on the inputs even if you have not specified.
Joshy George
<a href="http://www.linkedin.com/in/joshygeorge1" ><img src="http://www.linkedin.com/img/webpromo/bt ... _80x15.gif" width="80" height="15" border="0"></a>
Post Reply