Page 1 of 2

Pick a particular duplicate record

Posted: Mon Aug 02, 2010 2:06 pm
by dsusersaj
Is there any way we can pick a particular duplicate record from two duplicate records?.

For eg: if there are two duplicate records on Field1, and my requirment is to pick the record with field2 as "some particular value"..else pick the first out of the duplicates.

eg:

filed1 field2
123 abc
123 def

I should pick 123 abc. else the first record out of the group.

Thanks

Posted: Mon Aug 02, 2010 2:17 pm
by anbu
Use remove duplicate stage and set Duplicate to retain as First.

Also set execution mode to sequential in remove duplicate stage

Posted: Mon Aug 02, 2010 2:22 pm
by dsusersaj
anbu wrote:Use remove duplicate stage and set Duplicate to retain as First.

Also set execution mode to sequential in remove duplicate stage
Nope. I need to conditionally select a record from duplicate if the duplicate contains a particular value for a non-key field. Else retain the first record.

Posted: Mon Aug 02, 2010 2:23 pm
by dsusersaj
dsusersaj wrote:
anbu wrote:Use remove duplicate stage and set Duplicate to retain as First.

Also set execution mode to sequential in remove duplicate stage
Nope. I need to conditionally select a record from duplicate if the duplicate contains a particular value for a non-key field. Else retain the first record.
Is that level of programming possible in datastage?.

Posted: Mon Aug 02, 2010 2:42 pm
by anbu
What is the requirement to pick the field2?

Posted: Mon Aug 02, 2010 2:47 pm
by dsusersaj
anbu wrote:What is the requirement to pick the field2?
Requirement is that if out of the duplicates, field 2 is "0001" then exclude this record and take the other record. Else if it is a normal duplicate take any.

Posted: Mon Aug 02, 2010 2:52 pm
by chulett
If '0001' is your lowest possible value for that field, sort ascending on it and always pick the first duplicate as noted earlier.

Posted: Mon Aug 02, 2010 2:55 pm
by dsusersaj
chulett wrote:If '0001' is your lowest possible value for that field, sort ascending on it and always pick the first duplicate as noted earlier. ...
0001 is not my lowest possible value..This was just an example. My actual values are 07108 and 07107. There are values lower than and higher than these values in the list.

Posted: Mon Aug 02, 2010 2:55 pm
by anbu

Code: Select all

Source --> Transformer --> Remove Duplicate --> Target
Set the constraint in transformer to

Code: Select all

Field2 <> "0001"
Use Remove duplicates as mentioned above.

Posted: Mon Aug 02, 2010 2:58 pm
by dsusersaj
anbu wrote:

Code: Select all

Source --> Transformer --> Remove Duplicate --> Target
Set the constraint in transformer to

Code: Select all

Field2 <> "0001"
Use Remove duplicates as mentioned above.
I want to remove '0001' only when it is coming as duplicate.

In your design, there is a possibility that '0001' gets retained from the first part while the other records gets removed.

And the second part removes '0001' . SO from that group there is nothing that gets retained.

Posted: Mon Aug 02, 2010 3:03 pm
by kris007
anbu wrote:

Code: Select all

Source --> Transformer --> Remove Duplicate --> Target
Set the constraint in transformer to

Code: Select all

Field2 <> "0001"
Use Remove duplicates as mentioned above.
If "0001' is not always a constant value, you can always create a dummy column something like a flag you can set to based upon the input values within the transformer stage in the design above (from anbu) and then use the Remove Duplicates stage (by sorting the new dummy column along with the keys).

Hope that helps.

Posted: Tue Aug 03, 2010 8:36 am
by dsusersaj
kris007 wrote:
anbu wrote:

Code: Select all

Source --> Transformer --> Remove Duplicate --> Target
Set the constraint in transformer to

Code: Select all

Field2 <> "0001"
Use Remove duplicates as mentioned above.
If "0001' is not always a constant value, you can always create a dummy column something like a flag you can set to based upon the input values within the transformer stage in the design above (from anbu) and then use the Remove Duplicates stage (by sorting the new dummy column along with the keys).

Hope that helps.

This will also remove '0001' when it is not coming as a dupe.

Posted: Tue Aug 03, 2010 8:47 am
by kris007
Nope. You have to define your dummy column in a way that you can identify the values you need drop when they have duplicates. Something like

Code: Select all

If Field = '0001' Then 1 Else 2
In the Remove Duplicates, while you hash partition and sort the data on the keys, you need to just sort the data on the Dummy column (and not partition it). You can do this by right clicking on the Dummy Column within the Input >> Partitioning tab once you drag the keys onto the right hand side. As long as you sort your data on the Dummy column, you will retain "0001" when it is not coming in as a dupe.

Hope that helps.

Posted: Tue Aug 03, 2010 8:52 am
by dsusersaj
kris007 wrote:Nope. You have to define your dummy column in a way that you can identify the values you need drop when they have duplicates. Something like

Code: Select all

If Field = '0001' Then 1 Else 2
In the Remove Duplicates, while you hash partition and sort the data on the keys, you need to just sort the data on the Dummy column (and not partition it). You can do this by right clicking on the Dummy Column within the Input >> Partitioning tab once you drag the keys onto the right hand side. As long as you sort your data on the Dummy column, you will retain "0001" when it is not coming in as a dupe.

Hope that helps.
But the next step is to filter out 0001 right?.. Filed1<>'0001'. Then these wil be filtered out even if it is coming as a non-dupe.

Posted: Tue Aug 03, 2010 9:00 am
by SriKara
kris007 wrote:Nope. You have to define your dummy column in a way that you can identify the values you need drop when they have duplicates.
I guess the tricky part here is to define that dummy column with a value when there are duplicates on '0001' . May be you can pull out an extra link from source where the non-key column has '0001' and group by on the 'key' column to find the 'counts'. Count more than 1 would tell you that key has duplicates on '0001'. Use the resultant set as a lookup.

Or is there an easier way!!!