Pick a particular duplicate record

dsusersaj · Post by **dsusersaj** » Tue Aug 03, 2010 9:02 am

SriKara wrote:
kris007 wrote:Nope. You have to define your dummy column in a way that you can identify the values you need drop when they have duplicates.
I guess the tricky part here is to define that dummy column with a value when there are duplicates on '0001' . May be you can pull out an extra link from source where the non-key column has '0001' and group by on the 'key' column to find the 'counts'. Count more than 1 would tell you that key has duplicates on '0001'. Use the resultant set as a lookup.

Or is there an easier way!!!

I was thinking on the same lines too..To use an aggregator to get the duplicate set and use as lookup. Wanted to check if there is a simpler way as this requirement is very simple..

kris007 · Post by **kris007** » Tue Aug 03, 2010 9:07 am

In my mind this is still a simple design unless I am not understanding your requirement completely. I looked at your first post and I don't think the example you provided correctly reflects the issue you have. Can you be more specific with your example, lets say with about 4-5 sample records and how exactly you want your output to look like?

dsusersaj · Post by **dsusersaj** » Tue Aug 03, 2010 9:21 am

kris007 wrote:In my mind this is still a simple design unless I am not understanding your requirement completely. I looked at your first post and I don't think the example you provided correctly reflects the issue you have. Can you be more specific with your example, lets say with about 4-5 sample records and how exactly you want your output to look like?

Sample data with the dummy column defined..Whenver there is 0001 in Field2, Flag=1 Else Flag=0

Filed1(Key) Field2 Flag
ABC 0001 1
ABC 0002 0
DEF 0003 0
DEF 0004 0
EFG 0001 1

My requirement is to remove the 0001 when it is coming as a dupe on key field Field1.
So output shud be

Filed1(Key) Field2
ABC 0002
DEF 0003
EFG 0001

The desing which I am thinking is

Source---->Transforemer

From transformer, two links.One to aggregator to find the duplicate Field1 list(set1).Second to remove duplicate based on Field1 and Flag (set2).

Then lookup set2 on set1 on Field1.If the lookp matches and Flag=1 thenremove that record..Else pass it on.

Here i will be using aggregator,remove duplicate,lookup,and filter too which I thought is way too much processing for this simple logic.

I dont know how to create psudocode here to make it more clear.

[/img]

kris007 · Post by **kris007** » Tue Aug 03, 2010 9:44 am

Well, that's exactly the design I mentioned would do. In the Transformer Stage create the Dummy column similarly as mentioned earlier.

Code: Select all

If Field2 = '0001' Then 1 Else 0

In the Remove Duplicates Stage, mention Field 1 as the Key on the Stage properties tab. But in the Input >> Partitioning tab after selecting hash partitioning and selecting Perform Sort option select Field 1 and Dummy Column as keys, then on the right hand side, right click on the Dummy Column and select Sort direction as Ascending and usage as only Sorting. Also on the Stage properties tab, make sure you set Duplicate to Retain = First. Now, when you run your job, based upon the examples provided, for Field 1 = ABC you will pick up Field 2 = 0002 because you are sorting your flag in an ascending order and the flag is 0 for this record. For Field 1 = EFG because there is only one record available you will just pass it through and you will get your required output.

This is a very simple design and I have done it in the past multiple times. Let us know how you do.

SriKara · Post by **SriKara** » Tue Aug 03, 2010 9:51 am

'0001' was just an example. In his case, there are values greater than and less than that the constant number. So sorting might not actually work for him.

dsusersaj · Post by **dsusersaj** » Tue Aug 03, 2010 9:51 am

kris007 wrote:Well, that's exactly the design I mentioned would do. In the Transformer Stage create the Dummy column similarly as mentioned earlier.
Code: Select all
If Field2 = '0001' Then 1 Else 0
In the Remove Duplicates Stage, mention Field 1 as the Key on the Stage properties tab. But in the Input >> Partitioning tab after selecting hash partitioning and selecting Perform Sort option select Field 1 and Dummy Column as keys, then on the right hand side, right click on the Dummy Column and select Sort direction as Ascending and usage as only Sorting. Also on the Stage properties tab, make sure you set Duplicate to Retain = First. Now, when you run your job, based upon the examples provided, for Field 1 = ABC you will pick up Field 2 = 0002 because you are sorting your flag in an ascending order and the flag is 0 for this record. For Field 1 = EFG because there is only one record available you will just pass it through and you will get your required output.

This is a very simple design and I have done it in the past multiple times. Let us know how you do.

Yes. This should work!. Thanks.

I need to capture the duplicate rows too. SO I think I should use sort stage with keychange.

dsusersaj · Post by **dsusersaj** » Tue Aug 03, 2010 9:54 am

SriKara wrote:'0001' was just an example. In his case, there are values greater than and less than that the constant number. So sorting might not actually work for him.

I think this is taken care by defining the flag column. So in any case the duplicate record with '0001' will be flagged as 1 and will appear as the last row of the group which wil be always eliminated if its a duplicate.

SriKara · Post by **SriKara** » Tue Aug 03, 2010 9:56 am

dsusersaj wrote:
SriKara wrote:'0001' was just an example. In his case, there are values greater than and less than that the constant number. So sorting might not actually work for him.
I think this is taken care by defining the flag column. So in any case the duplicate record with '0001' will be flagged as 1 and will appear as the last row of the group which wil be always eliminated if its a duplicate.

You are right. Thanks.

kris007 · Post by **kris007** » Tue Aug 03, 2010 10:04 am

dsusersaj wrote:I need to capture the duplicate rows too. SO I think I should use sort stage with keychange.

You didn't mention that in your original post. As you mentioned you can do it using the Sort stage with Create Key Change Column set to True and capturing the required records downstream.

dsusersaj · Post by **dsusersaj** » Tue Aug 03, 2010 10:09 am

kris007 wrote:
dsusersaj wrote:I need to capture the duplicate rows too. SO I think I should use sort stage with keychange.
You didn't mention that in your original post. As you mentioned you can do it using the Sort stage with Create Key Change Column set to True and capturing the required records downstream.

Yeah. I just wanted the logic and I knew that to capture dupes, I can use sort stage with keychange option.

Thank you for your time and effort.