Remove dulpicate

eyabmo_rbc · Post by **eyabmo_rbc** » Fri Jan 18, 2008 2:13 pm

Its so funny , how this component act on data ,, am passing data with 4 columns key to this component , and guess what ,, it doesnt catch the duplicate !!! thats funny

example : we have an input stream with keys col1,col2,col3,col4

inside the remove duplicate component m , i do sort based on those keys for the incoming stream , and defined those 4 keys as my uniqueness key ,,

does this component is not composite keys friendly ?

thanks

ray.wurlod · Post by **ray.wurlod** » Fri Jan 18, 2008 3:27 pm

Composite keys are fine. Are you data partitioned, as well as sorted, on these key fields?

Not being partitioned on the keys would seem to manifest as "missing (some) duplicates" if the duplicates were on different partitions as a result, say, of Round Robin partitioning.

eyabmo_rbc · Post by **eyabmo_rbc** » Fri Jan 18, 2008 3:45 pm

Hi;

Thanks for the response , Yes i did partioned the data and hash sorted records based on the same key .. i guess now am seeing the data different.
Its corrrect now.
thanks

ray.wurlod wrote:Composite keys are fine. Are you data partitioned, as well as sorted, on these key fields?

Not being partitioned on the keys would seem to manifest as "missing (some) duplicates" if the duplicates ...

kumar_s · Post by **kumar_s** » Fri Jan 18, 2008 5:28 pm

Auto partiton could lead to RoundRobin partition is any stage and the same could be propagated. And thus the records could have been let on different nodes during duplicate removal process.

just4u_sharath · Post by **just4u_sharath** » Fri Jan 18, 2008 10:12 pm

kumar_s wrote:Auto partiton could lead to RoundRobin partition is any stage and the same could be propagated. And thus the records could have been let on different nodes during duplicate removal process.

Does always partition always leads to roundrobin in any stage?

ray.wurlod · Post by **ray.wurlod** » Sat Jan 19, 2008 1:08 am

No.

(Auto) leads to Round Robin except:

on reference input to Lookup stage - Entire

on inputs to Join and Merge stages - Hash on join key(s)

on DB2/UDB Enterprise stages - DB2

on other parallel to parallel with same degree of parallelism - Same

eyabmo_rbc · Post by **eyabmo_rbc** » Mon Jan 21, 2008 3:53 pm

SO , do you recommend partioning the data ( HASH ) based on the key , before we sort it , then remove the duplicate ?

ray.wurlod wrote:No.

(Auto) leads to Round Robin except:
on reference input to Lookup stage - Entire

on inputs to Join and Merge stages - Hash on join key(s)

on DB2/UDB Enterprise stages - DB2

on ot ...

ray.wurlod · Post by **ray.wurlod** » Mon Jan 21, 2008 8:16 pm

Whether I recommend it or not is irrelevant. It's what you have to do.