Page 1 of 1

duplicate issue

Posted: Wed Oct 05, 2011 2:26 pm
by India2000
duplicat records with same key value but different field values,how does DS picks up values. which records does it takes for processing? is there any order for selecting records from the duplciates?

Posted: Wed Oct 05, 2011 3:14 pm
by jwiles
Many of the stages don't care whether records are duplicates or not and will happily process all of the records.

Certain stages can identify and optionally remove duplicates (Sort, Remove Duplicates), others may handle duplicates differently depending upon the input link and/or options chosen (join, lookup, merge), still others may depend upon the operation of an outside system (database stages, for instance) to handle duplicates.

To learn the specifics about how the various stages can handle duplicates, read the Information Server Parallel Job Developer's Guide, available for download from <a href="https://www-304.ibm.com/support/docview ... 0">here</a>.

Stages such as transformers, custom operators and buildops can work with other stages to identify and handle duplicates as required.

You can design your job to meet the requirements you have with regards to duplicate handling. Therefore, the answer is ultimately: "What do you need it to do?"

Regards,

Posted: Wed Oct 05, 2011 3:30 pm
by ray.wurlod
You might also contemplate what you want to do if there are nulls in the data.

Posted: Fri Oct 07, 2011 12:14 am
by India2000
I have a job where ref is .ds file and one of the 2 fields is the key field.I see there are around 15 records with same key value but different values for the other field. this stage is used as a reference for lookup.while doing lookup on what criteria the record will be selected and loaded into target. How datastge handles this duplicates at the lookup stage and selects one record?

Posted: Fri Oct 07, 2011 12:23 am
by ray.wurlod
Lookup returns the first one found, but still checks for others, unless you enable multiple row return from that reference input link.

Posted: Fri Oct 07, 2011 12:29 am
by suse_dk
There is no criteria for which duplicate is choosen - it is just the first one encounted that the match will be performed on.

So, unless you want multiple rows in the output, then you should remove duplicates in either a sort or remove duplicate stage, where you can define the criteria

Posted: Fri Oct 07, 2011 6:36 am
by chulett
suse_dk wrote:There is no criteria for which duplicate is choosen - it is just the first one encounted that the match will be performed on.
That sounds like criteria to me. :wink:

Posted: Fri Oct 07, 2011 11:07 am
by suse_dk
:roll:

Posted: Fri Oct 07, 2011 2:40 pm
by chulett
Seriously? Sorry but first you said there's no criteria and then you stated the criteria it uses. Wasn't trying to bust anyone's chops, it just tickled my funny bone a bit. And FWIW, that behaviour matches what an Informatica lookup does when its selection criteria is set to "First".