Page 1 of 1

Query on Sort / Remove Duplicates

Posted: Thu May 17, 2007 2:42 am
by ag_ram
Hello All

I have a parallel job , the input with a set of 5 Keys and 2 value columns. There is a Parallel Sort Stage and a Remove Duplicate follows. The query is i sort the data on all 5 keys and then on value columns , and i remove duplicates on the first 3 Keys .

The expected behavior is to get "least value columns" , and works fine with a few 1000 records , the behavior is haphazard when data volume increases (improper data appearing in the input)

2 Queries :

a) Is the Job Design right ? (Sort on 5 keys and RD on only 3)

b) What can be a possible solution if Q1 answer is wrong ?

Posted: Thu May 17, 2007 3:43 am
by Maveric
Answer to the first question is no. You might be getting a warning regarding this in the job log. You will have to use the same keys for both sort and remove duplicate stage and in the same order. To get the least value can set the sort order to ascending and the duplicates to retain to first. You also need to hash partition the data on the same keys.

Posted: Thu May 17, 2007 4:12 am
by JoshGeorge
Use Sort stage to Sort (Asc)/Partition on the 3 keys you want to remove duplicate and specify remove duplicates to True in sort stage itself.

re:how to removing duplicates or sorting jobs in server jobs

Posted: Thu May 17, 2007 6:17 am
by seshikumar
pla can any body help me
how do removing duplicates in server jobs[/b]

Posted: Thu May 17, 2007 11:47 am
by Maveric
Seshikumar you are asking the question in the wrong forum. Post the query in server forum for better results. Anyway here is one link for your answer. viewtopic.php?t=102106&highlight=remove+duplicates