Is Sort stage before Remove-Duplicate stage mandatory?

Amit Jaiswal · Post by **Amit Jaiswal** » Thu Sep 29, 2005 2:48 am

Hi All,
We have tested Remove-Duplicate stage without using any Sort Stage before it. It is working fine and gives expected results. However, in DS help, it is specifically given that data should be presorted and Sort Stage should be used before using Remove-Duplicate stage. During development we are using only one node and partition type is Auto. Will Remove-Duplicate stage without presorted data, cause any issue with more number of nodes and with other type of partitions?
Thanks in advance.
-Amit

cmmurari · Post by **cmmurari** » Thu Sep 29, 2005 3:14 am

Basic rule is input stream should be sorted. pls go through parallel jobs developer's guide.

cheers,
krish

roy · Post by **roy** » Thu Sep 29, 2005 9:30 am

Hi,
If you know the input stream is already sorted (like a db stage using order by) then you might not need to sort the data again.
The same for select statements that comes sorted naturally as Oracle some times give.
In case your not sure then you need the sort stage, using the RD stage's link option for sorting the data won't do!

IHTH,

kumar_s · Post by **kumar_s** » Sat Oct 01, 2005 2:30 am

Hi,
May be the reason that you might have sorted the data in some earliar stage or streams.

BTW:

using the RD stage's link option for sorting the data won't do

may i know why this happens, if so what is the difference between a explicit sort stage operation and the presort....

regards
kumar

aartlett · Post by **aartlett** » Mon Oct 03, 2005 12:17 am

For reasonable amounts of data (< 2gb) I've always been partial to a sort -u before the job

However, as people probably have noticed by now, I'm not a datastage purist, I think there are other ways of doing things.

Andrew the Heretic

ag_ram · Post by **ag_ram** » Sun May 27, 2007 8:12 am

hi

The sorting before the remove duplicates is necessary. In your case it is working because the auto partitioning is taking care of the sorting. it is doing an inline sort on the keys based on which it is doing the remove duplicates.

Thanks,
Ram.

ray.wurlod · Post by **ray.wurlod** » Sun May 27, 2007 7:23 pm

It's not partitioning. Look at the score. Note that DataStage has inserted some tsort operators (and probably some buffer operators also) on the inputs. So, if you don't specify sorting, DataStage will insert sorting. You might prefer a Sort stage so you can tell it "don't sort (previously sorted)" explicitly.

JoshGeorge · Post by **JoshGeorge** » Sun May 27, 2007 10:29 pm

Set APT_NO_SORT_INSERTION to True in your job and run to see the difference.

keshav0307 · Post by **keshav0307** » Mon May 28, 2007 2:51 am

For remove dulpicate, use Hash partitioning on the Key columns.
not sure, but Auto partition will distribute the records in round robin manner, so , you may still get duplicate output, if you are using more then one 1 node.

JoshGeorge · Post by **JoshGeorge** » Mon May 28, 2007 3:23 am

You don't have to do an explicit 'Hash partitioning' ! See the score and you can see why. If you include

$APT_NO_PART_INSERTION = True
$APT_NO_SORT_INSERTION = True

and run the job and see the score. You can see the difference.

Datastage inserts what is required on the inputs even if you have not specified.