Pick a particular duplicate record

dsusersaj · Post by **dsusersaj** » Mon Aug 02, 2010 2:06 pm

Is there any way we can pick a particular duplicate record from two duplicate records?.

For eg: if there are two duplicate records on Field1, and my requirment is to pick the record with field2 as "some particular value"..else pick the first out of the duplicates.

eg:

filed1 field2
123 abc
123 def

I should pick 123 abc. else the first record out of the group.

Thanks

anbu · Post by **anbu** » Mon Aug 02, 2010 2:17 pm

Use remove duplicate stage and set Duplicate to retain as First.

Also set execution mode to sequential in remove duplicate stage

dsusersaj · Post by **dsusersaj** » Mon Aug 02, 2010 2:22 pm

anbu wrote:Use remove duplicate stage and set Duplicate to retain as First.

Also set execution mode to sequential in remove duplicate stage

Nope. I need to conditionally select a record from duplicate if the duplicate contains a particular value for a non-key field. Else retain the first record.

dsusersaj · Post by **dsusersaj** » Mon Aug 02, 2010 2:23 pm

dsusersaj wrote:
anbu wrote:Use remove duplicate stage and set Duplicate to retain as First.

Also set execution mode to sequential in remove duplicate stage
Nope. I need to conditionally select a record from duplicate if the duplicate contains a particular value for a non-key field. Else retain the first record.

Is that level of programming possible in datastage?.

anbu · Post by **anbu** » Mon Aug 02, 2010 2:42 pm

What is the requirement to pick the field2?

dsusersaj · Post by **dsusersaj** » Mon Aug 02, 2010 2:47 pm

anbu wrote:What is the requirement to pick the field2?

Requirement is that if out of the duplicates, field 2 is "0001" then exclude this record and take the other record. Else if it is a normal duplicate take any.

chulett · Post by **chulett** » Mon Aug 02, 2010 2:52 pm

If '0001' is your lowest possible value for that field, sort ascending on it and always pick the first duplicate as noted earlier.

dsusersaj · Post by **dsusersaj** » Mon Aug 02, 2010 2:55 pm

chulett wrote:If '0001' is your lowest possible value for that field, sort ascending on it and always pick the first duplicate as noted earlier. ...

0001 is not my lowest possible value..This was just an example. My actual values are 07108 and 07107. There are values lower than and higher than these values in the list.

anbu · Post by **anbu** » Mon Aug 02, 2010 2:55 pm

Code: Select all

Source --> Transformer --> Remove Duplicate --> Target

Set the constraint in transformer to

Code: Select all

Field2 <> "0001"

Use Remove duplicates as mentioned above.

dsusersaj · Post by **dsusersaj** » Mon Aug 02, 2010 2:58 pm

anbu wrote:
Code: Select all
Source --> Transformer --> Remove Duplicate --> Target
Set the constraint in transformer to
Code: Select all
Field2 <> "0001"
Use Remove duplicates as mentioned above.

I want to remove '0001' only when it is coming as duplicate.

In your design, there is a possibility that '0001' gets retained from the first part while the other records gets removed.

And the second part removes '0001' . SO from that group there is nothing that gets retained.

kris007 · Post by **kris007** » Mon Aug 02, 2010 3:03 pm

anbu wrote:
Code: Select all
Source --> Transformer --> Remove Duplicate --> Target
Set the constraint in transformer to
Code: Select all
Field2 <> "0001"
Use Remove duplicates as mentioned above.

If "0001' is not always a constant value, you can always create a dummy column something like a flag you can set to based upon the input values within the transformer stage in the design above (from anbu) and then use the Remove Duplicates stage (by sorting the new dummy column along with the keys).

Hope that helps.

dsusersaj · Post by **dsusersaj** » Tue Aug 03, 2010 8:36 am

kris007 wrote:
anbu wrote:
Code: Select all
Source --> Transformer --> Remove Duplicate --> Target
Set the constraint in transformer to
Code: Select all
Field2 <> "0001"
Use Remove duplicates as mentioned above.
If "0001' is not always a constant value, you can always create a dummy column something like a flag you can set to based upon the input values within the transformer stage in the design above (from anbu) and then use the Remove Duplicates stage (by sorting the new dummy column along with the keys).

Hope that helps.

This will also remove '0001' when it is not coming as a dupe.

kris007 · Post by **kris007** » Tue Aug 03, 2010 8:47 am

Nope. You have to define your dummy column in a way that you can identify the values you need drop when they have duplicates. Something like

Code: Select all

If Field = '0001' Then 1 Else 2

In the Remove Duplicates, while you hash partition and sort the data on the keys, you need to just sort the data on the Dummy column (and not partition it). You can do this by right clicking on the Dummy Column within the Input >> Partitioning tab once you drag the keys onto the right hand side. As long as you sort your data on the Dummy column, you will retain "0001" when it is not coming in as a dupe.

Hope that helps.

dsusersaj · Post by **dsusersaj** » Tue Aug 03, 2010 8:52 am

kris007 wrote:Nope. You have to define your dummy column in a way that you can identify the values you need drop when they have duplicates. Something like
Code: Select all
If Field = '0001' Then 1 Else 2
In the Remove Duplicates, while you hash partition and sort the data on the keys, you need to just sort the data on the Dummy column (and not partition it). You can do this by right clicking on the Dummy Column within the Input >> Partitioning tab once you drag the keys onto the right hand side. As long as you sort your data on the Dummy column, you will retain "0001" when it is not coming in as a dupe.

Hope that helps.

But the next step is to filter out 0001 right?.. Filed1<>'0001'. Then these wil be filtered out even if it is coming as a non-dupe.

SriKara · Post by **SriKara** » Tue Aug 03, 2010 9:00 am

kris007 wrote:Nope. You have to define your dummy column in a way that you can identify the values you need drop when they have duplicates.

I guess the tricky part here is to define that dummy column with a value when there are duplicates on '0001' . May be you can pull out an extra link from source where the non-key column has '0001' and group by on the 'key' column to find the 'counts'. Count more than 1 would tell you that key has duplicates on '0001'. Use the resultant set as a lookup.

Or is there an easier way!!!