Pick a particular duplicate record

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

dsusersaj
Premium Member
Premium Member
Posts: 160
Joined: Mon Dec 17, 2007 10:44 am

Post by dsusersaj »

SriKara wrote:
kris007 wrote:Nope. You have to define your dummy column in a way that you can identify the values you need drop when they have duplicates.
I guess the tricky part here is to define that dummy column with a value when there are duplicates on '0001' . May be you can pull out an extra link from source where the non-key column has '0001' and group by on the 'key' column to find the 'counts'. Count more than 1 would tell you that key has duplicates on '0001'. Use the resultant set as a lookup.

Or is there an easier way!!!
I was thinking on the same lines too..To use an aggregator to get the duplicate set and use as lookup. Wanted to check if there is a simpler way as this requirement is very simple..
kris007
Charter Member
Charter Member
Posts: 1102
Joined: Tue Jan 24, 2006 5:38 pm
Location: Riverside, RI

Post by kris007 »

In my mind this is still a simple design unless I am not understanding your requirement completely. I looked at your first post and I don't think the example you provided correctly reflects the issue you have. Can you be more specific with your example, lets say with about 4-5 sample records and how exactly you want your output to look like?
Kris

Where's the "Any" key?-Homer Simpson
dsusersaj
Premium Member
Premium Member
Posts: 160
Joined: Mon Dec 17, 2007 10:44 am

Post by dsusersaj »

kris007 wrote:In my mind this is still a simple design unless I am not understanding your requirement completely. I looked at your first post and I don't think the example you provided correctly reflects the issue you have. Can you be more specific with your example, lets say with about 4-5 sample records and how exactly you want your output to look like?
Sample data with the dummy column defined..Whenver there is 0001 in Field2, Flag=1 Else Flag=0

Filed1(Key) Field2 Flag
ABC 0001 1
ABC 0002 0
DEF 0003 0
DEF 0004 0
EFG 0001 1

My requirement is to remove the 0001 when it is coming as a dupe on key field Field1.
So output shud be

Filed1(Key) Field2
ABC 0002
DEF 0003
EFG 0001


The desing which I am thinking is


Source---->Transforemer


From transformer, two links.One to aggregator to find the duplicate Field1 list(set1).Second to remove duplicate based on Field1 and Flag (set2).

Then lookup set2 on set1 on Field1.If the lookp matches and Flag=1 thenremove that record..Else pass it on.

Here i will be using aggregator,remove duplicate,lookup,and filter too which I thought is way too much processing for this simple logic.

I dont know how to create psudocode here to make it more clear.


[/img]
kris007
Charter Member
Charter Member
Posts: 1102
Joined: Tue Jan 24, 2006 5:38 pm
Location: Riverside, RI

Post by kris007 »

Well, that's exactly the design I mentioned would do. In the Transformer Stage create the Dummy column similarly as mentioned earlier.

Code: Select all

If Field2 = '0001' Then 1 Else 0
In the Remove Duplicates Stage, mention Field 1 as the Key on the Stage properties tab. But in the Input >> Partitioning tab after selecting hash partitioning and selecting Perform Sort option select Field 1 and Dummy Column as keys, then on the right hand side, right click on the Dummy Column and select Sort direction as Ascending and usage as only Sorting. Also on the Stage properties tab, make sure you set Duplicate to Retain = First. Now, when you run your job, based upon the examples provided, for Field 1 = ABC you will pick up Field 2 = 0002 because you are sorting your flag in an ascending order and the flag is 0 for this record. For Field 1 = EFG because there is only one record available you will just pass it through and you will get your required output.

This is a very simple design and I have done it in the past multiple times. Let us know how you do.
Kris

Where's the "Any" key?-Homer Simpson
SriKara
Premium Member
Premium Member
Posts: 30
Joined: Wed Jun 01, 2005 8:40 am
Location: UK

Post by SriKara »

'0001' was just an example. In his case, there are values greater than and less than that the constant number. So sorting might not actually work for him.
Regards,
SriKara
dsusersaj
Premium Member
Premium Member
Posts: 160
Joined: Mon Dec 17, 2007 10:44 am

Post by dsusersaj »

kris007 wrote:Well, that's exactly the design I mentioned would do. In the Transformer Stage create the Dummy column similarly as mentioned earlier.

Code: Select all

If Field2 = '0001' Then 1 Else 0
In the Remove Duplicates Stage, mention Field 1 as the Key on the Stage properties tab. But in the Input >> Partitioning tab after selecting hash partitioning and selecting Perform Sort option select Field 1 and Dummy Column as keys, then on the right hand side, right click on the Dummy Column and select Sort direction as Ascending and usage as only Sorting. Also on the Stage properties tab, make sure you set Duplicate to Retain = First. Now, when you run your job, based upon the examples provided, for Field 1 = ABC you will pick up Field 2 = 0002 because you are sorting your flag in an ascending order and the flag is 0 for this record. For Field 1 = EFG because there is only one record available you will just pass it through and you will get your required output.

This is a very simple design and I have done it in the past multiple times. Let us know how you do.

Yes. This should work!. Thanks.

I need to capture the duplicate rows too. SO I think I should use sort stage with keychange.
dsusersaj
Premium Member
Premium Member
Posts: 160
Joined: Mon Dec 17, 2007 10:44 am

Post by dsusersaj »

SriKara wrote:'0001' was just an example. In his case, there are values greater than and less than that the constant number. So sorting might not actually work for him.
I think this is taken care by defining the flag column. So in any case the duplicate record with '0001' will be flagged as 1 and will appear as the last row of the group which wil be always eliminated if its a duplicate.
SriKara
Premium Member
Premium Member
Posts: 30
Joined: Wed Jun 01, 2005 8:40 am
Location: UK

Post by SriKara »

dsusersaj wrote:
SriKara wrote:'0001' was just an example. In his case, there are values greater than and less than that the constant number. So sorting might not actually work for him.
I think this is taken care by defining the flag column. So in any case the duplicate record with '0001' will be flagged as 1 and will appear as the last row of the group which wil be always eliminated if its a duplicate.
You are right. Thanks.
Regards,
SriKara
kris007
Charter Member
Charter Member
Posts: 1102
Joined: Tue Jan 24, 2006 5:38 pm
Location: Riverside, RI

Post by kris007 »

dsusersaj wrote:I need to capture the duplicate rows too. SO I think I should use sort stage with keychange.
You didn't mention that in your original post. As you mentioned you can do it using the Sort stage with Create Key Change Column set to True and capturing the required records downstream.
Kris

Where's the "Any" key?-Homer Simpson
dsusersaj
Premium Member
Premium Member
Posts: 160
Joined: Mon Dec 17, 2007 10:44 am

Post by dsusersaj »

kris007 wrote:
dsusersaj wrote:I need to capture the duplicate rows too. SO I think I should use sort stage with keychange.
You didn't mention that in your original post. As you mentioned you can do it using the Sort stage with Create Key Change Column set to True and capturing the required records downstream.
Yeah. I just wanted the logic and I knew that to capture dupes, I can use sort stage with keychange option.

Thank you for your time and effort.
Post Reply