Page 1 of 1

Remove dulpicate

Posted: Fri Jan 18, 2008 2:13 pm
by eyabmo_rbc
Its so funny , how this component act on data ,, am passing data with 4 columns key to this component , and guess what ,, it doesnt catch the duplicate !!! thats funny

example : we have an input stream with keys col1,col2,col3,col4

inside the remove duplicate component m , i do sort based on those keys for the incoming stream , and defined those 4 keys as my uniqueness key ,,

does this component is not composite keys friendly ?

thanks

Posted: Fri Jan 18, 2008 3:27 pm
by ray.wurlod
Composite keys are fine. Are you data partitioned, as well as sorted, on these key fields?

Not being partitioned on the keys would seem to manifest as "missing (some) duplicates" if the duplicates were on different partitions as a result, say, of Round Robin partitioning.

Posted: Fri Jan 18, 2008 3:45 pm
by eyabmo_rbc
Hi;

Thanks for the response , Yes i did partioned the data and hash sorted records based on the same key .. i guess now am seeing the data different.
Its corrrect now.
thanks

ray.wurlod wrote:Composite keys are fine. Are you data partitioned, as well as sorted, on these key fields?

Not being partitioned on the keys would seem to manifest as "missing (some) duplicates" if the duplicates ...

Posted: Fri Jan 18, 2008 5:28 pm
by kumar_s
Auto partiton could lead to RoundRobin partition is any stage and the same could be propagated. And thus the records could have been let on different nodes during duplicate removal process.

Posted: Fri Jan 18, 2008 10:12 pm
by just4u_sharath
kumar_s wrote:Auto partiton could lead to RoundRobin partition is any stage and the same could be propagated. And thus the records could have been let on different nodes during duplicate removal process.
Does always partition always leads to roundrobin in any stage?

Posted: Sat Jan 19, 2008 1:08 am
by ray.wurlod
No.

(Auto) leads to Round Robin except:
  • on reference input to Lookup stage - Entire

    on inputs to Join and Merge stages - Hash on join key(s)

    on DB2/UDB Enterprise stages - DB2

    on other parallel to parallel with same degree of parallelism - Same

Posted: Mon Jan 21, 2008 3:53 pm
by eyabmo_rbc
SO , do you recommend partioning the data ( HASH ) based on the key , before we sort it , then remove the duplicate ?

ray.wurlod wrote:No.

(Auto) leads to Round Robin except:
  • on reference input to Lookup stage - Entire

    on inputs to Join and Merge stages - Hash on join key(s)

    on DB2/UDB Enterprise stages - DB2

    on ot ...

Posted: Mon Jan 21, 2008 8:16 pm
by ray.wurlod
Whether I recommend it or not is irrelevant. It's what you have to do.