Page 1 of 1

Is Sort stage before Remove-Duplicate stage mandatory?

Posted: Thu Sep 29, 2005 2:48 am
by Amit Jaiswal
Hi All,
We have tested Remove-Duplicate stage without using any Sort Stage before it. It is working fine and gives expected results. However, in DS help, it is specifically given that data should be presorted and Sort Stage should be used before using Remove-Duplicate stage. During development we are using only one node and partition type is Auto. Will Remove-Duplicate stage without presorted data, cause any issue with more number of nodes and with other type of partitions?
Thanks in advance.
-Amit

Posted: Thu Sep 29, 2005 3:14 am
by cmmurari
Basic rule is input stream should be sorted. pls go through parallel jobs developer's guide.


cheers,
krish

Posted: Thu Sep 29, 2005 9:30 am
by roy
Hi,
If you know the input stream is already sorted (like a db stage using order by) then you might not need to sort the data again.
The same for select statements that comes sorted naturally as Oracle some times give.
In case your not sure then you need the sort stage, using the RD stage's link option for sorting the data won't do!

IHTH,

Posted: Sat Oct 01, 2005 2:30 am
by kumar_s
Hi,
May be the reason that you might have sorted the data in some earliar stage or streams.



BTW:
using the RD stage's link option for sorting the data won't do
may i know why this happens, if so what is the difference between a explicit sort stage operation and the presort....

regards
kumar

Posted: Mon Oct 03, 2005 12:17 am
by aartlett
For reasonable amounts of data (< 2gb) I've always been partial to a sort -u before the job :)

However, as people probably have noticed by now, I'm not a datastage purist, I think there are other ways of doing things.

Andrew the Heretic

Posted: Sun May 27, 2007 8:12 am
by ag_ram
hi

The sorting before the remove duplicates is necessary. In your case it is working because the auto partitioning is taking care of the sorting. it is doing an inline sort on the keys based on which it is doing the remove duplicates.

Thanks,
Ram.

Posted: Sun May 27, 2007 7:23 pm
by ray.wurlod
It's not partitioning. Look at the score. Note that DataStage has inserted some tsort operators (and probably some buffer operators also) on the inputs. So, if you don't specify sorting, DataStage will insert sorting. You might prefer a Sort stage so you can tell it "don't sort (previously sorted)" explicitly.

Posted: Sun May 27, 2007 10:29 pm
by JoshGeorge
Set APT_NO_SORT_INSERTION to True in your job and run to see the difference.

Posted: Mon May 28, 2007 2:51 am
by keshav0307
For remove dulpicate, use Hash partitioning on the Key columns.
not sure, but Auto partition will distribute the records in round robin manner, so , you may still get duplicate output, if you are using more then one 1 node.

Posted: Mon May 28, 2007 3:23 am
by JoshGeorge
You don't have to do an explicit 'Hash partitioning' ! See the score and you can see why. If you include

$APT_NO_PART_INSERTION = True
$APT_NO_SORT_INSERTION = True

and run the job and see the score. You can see the difference.

Datastage inserts what is required on the inputs even if you have not specified.