Strange Duplicate scenario

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
datastagenewbie
Participant
Posts: 64
Joined: Tue Sep 23, 2008 9:54 am

Strange Duplicate scenario

Post by datastagenewbie »

I have a strange duplicate scenario , below is my scenario.

Code: Select all

key value start      end
1   a     1/1/2010   12/31/2010
1   a     3/1/2010   9/30/2010
In the above case it should duplicate, as there is a overlapping time period

Code: Select all

key value start      end
1   a     1/1/2010   12/31/2010
1   a     1/1/2011   12/31/2011
And in this case it should not

Any ideas of how it can be implemented in datastage??
THanks in advance.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Do a range look up of the same data and mark the looked up records as duplicate and reject it.
In the given example, the second record of the first set will get a match on the first record. And hence duplicate.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
PhilHibbs
Premium Member
Premium Member
Posts: 1044
Joined: Wed Sep 29, 2004 3:30 am
Location: Nottingham, UK
Contact:

Post by PhilHibbs »

kumar_s wrote:Do a range look up of the same data and mark the looked up records as duplicate and reject it.
In the given example, the second record of the first set will get a match on the first record. And hence duplicate.
That sounds quite complicated. Do you mean, look up both the start and end date against every other pair of values other than itself? That would involve creating n-1 lookups for n rows of data, so if there are 20 rows, you need to create 19 lookups. If there are 100 rows, you need 99 lookups. I wouldn't have a clue how to build that.
Phil Hibbs | Capgemini
Technical Consultant
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

:idea: People, don't add the tags that we can use here - bold, italics, code, etc - and then enable / check the Disable BBCode in this post option. Kind of defeats the whole purpose.

To the O/P, as I would say elsewhere: FTFY... fixed that for ya. :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
vivekgadwal
Premium Member
Premium Member
Posts: 457
Joined: Tue Sep 25, 2007 4:05 pm

Post by vivekgadwal »

I wonder if the scenario is more complicated than I think it is! Please do let me know if I am overlooking something... :?

My suggestion - Use the "previous row - next row comparison" logic and within that, compare the year for the same key column (in your case, the first two fields). If they belong to the same year, then flag it as duplicate and then use a filter stage to filter off those flagged rows.
Vivek Gadwal

Experience is what you get when you didn't get what you wanted
PhilHibbs
Premium Member
Premium Member
Posts: 1044
Joined: Wed Sep 29, 2004 3:30 am
Location: Nottingham, UK
Contact:

Post by PhilHibbs »

vivekgadwal wrote:My suggestion - Use the "previous row - next row comparison" logic and within that, compare the year for the same key column (in your case, the first two fields). If they belong to the same year, then flag it as duplicate and then use a filter stage to filter off those flagged rows.
They don't overlap just because the year is the same, just that the example happens to be ending on 12/31 and next one starting on 01/01 but I expect that is not a rule. I don't think this example would be an overlap:

Code: Select all

key value start      end
1   a     1/1/2010   10/31/2010
1   a     11/1/2010  12/31/2011
p.s. can we avoid locale-specific dates? It gets very confusing having the dd/mm the wrong way around... so the above would be:

Code: Select all

key value start      end
1   a     2010-01-01   2010-10-31
1   a     2010-11-01   2011-12-31
So if you sort your data by start and process it sequentially, you could check whether the start of a record is earlier than the end of the previous record. Would that do the trick?
Phil Hibbs | Capgemini
Technical Consultant
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

PhilHibbs wrote:
kumar_s wrote:Do a range look up of the same data and mark the looked up records as duplicate and reject it.
In the given example, the second record of the first set will get a match on the first record. And hence duplicate.
That sounds quite complicated. Do you mean, look up both the start and end date against every other pair of values other than itself? That would involve creating n-1 lookups for n rows of data, so if there are 20 rows, you need to create 19 lookups. If there are 100 rows, you need 99 lookups. I wouldn't have a clue how to build that.
No, just do one lookup with Key and Value as lookup key. And do the range check in Dates. If Input.Start > Lookup.Start And Input.End<Lookup.End.
Those records which satisfies the condition and the lookup are the duplicates.

The range check can be done in seperate Transformer too, if it was a different version other than 8x.

Am I making any sense?
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
PhilHibbs
Premium Member
Premium Member
Posts: 1044
Joined: Wed Sep 29, 2004 3:30 am
Location: Nottingham, UK
Contact:

Post by PhilHibbs »

kumar_s wrote:Am I making any sense?
Not to me. That logic would only detect a range that is entirely within another range:

Code: Select all

Row 1:           <----------------->
Row 2:              <------------>
...not overlaps like this:

Code: Select all

Row 1:           <--------------->
Row 2:              <-------------->
...or this:

Code: Select all

Row 1:           <--------------->
Row 2:           <--------------->
(where start and end date are the same on two different rows)
Phil Hibbs | Capgemini
Technical Consultant
PhilHibbs
Premium Member
Premium Member
Posts: 1044
Joined: Wed Sep 29, 2004 3:30 am
Location: Nottingham, UK
Contact:

Post by PhilHibbs »

no idea why I'm getting dup posts...
Last edited by PhilHibbs on Thu Aug 12, 2010 9:35 am, edited 1 time in total.
Phil Hibbs | Capgemini
Technical Consultant
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Perhaps Iam missing something somewhere...

May be the Lookup might be the confusion. It should be Join (With cross product) Full outer.

Code: Select all

If Input.Start > Lookup.Start And Input.End<Lookup.End Then 1 Else 0
Wouldn't the above code results to 1 for the first scenario that you had given and 0 for second.
For Third OP has to decide wheather its a duplicate or not.

For a each set of keys, there will be a cross product join.
For two records set for given key 1 and Value a There will be four records to operate.

Eg

Code: Select all

key	value	Inp.Start	Inp.End	Lkp.Start	Lkp.End	Inp.Unique	Lkp.Unique
1	a	1/1/2010	12/31/2010	1/1/2010	12/31/2010	1234	1234
1	a	 1/1/2011	12/31/2011	1/1/2010	12/31/2010	1235	1234
1	a	1/1/2010	12/31/2010	 1/1/2011	12/31/2011	1234	1235
1	a	 1/1/2011	12/31/2011	 1/1/2011	12/31/2011	1235	1235
Let the Unique field be either other available key in the input or self created one. So the records with

Code: Select all

Inp.Unique = Lkp.Unique
has to discarded at first setp.


With the above records set, would we be albe to find the duplicates?
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

PhilHibbs wrote:no idea why I'm getting dup posts...
It happens. You can delete the latest post, if you like to.
Not anymore, as there got some replies upon it.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
PhilHibbs
Premium Member
Premium Member
Posts: 1044
Joined: Wed Sep 29, 2004 3:30 am
Location: Nottingham, UK
Contact:

Post by PhilHibbs »

kumar_s wrote:Perhaps Iam missing something somewhere...

May be the Lookup might be the confusion. It should be Join (With cross product) Full outer.
Ah, that makes more sense - that is, essentially, doing what I said - referencing each row against a different set of data.

Code: Select all

If Input.Start > Lookup.Start And Input.End<Lookup.End Then 1 Else 0
kumar_s wrote:Wouldn't the above code results to 1 for the first scenario that you had given and 0 for second.
I guess it depends on what the OP defines as an overlap.
Phil Hibbs | Capgemini
Technical Consultant
PhilHibbs
Premium Member
Premium Member
Posts: 1044
Joined: Wed Sep 29, 2004 3:30 am
Location: Nottingham, UK
Contact:

Post by PhilHibbs »

another dup...
Last edited by PhilHibbs on Thu Aug 12, 2010 9:55 am, edited 1 time in total.
Phil Hibbs | Capgemini
Technical Consultant
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

This can be accomplished using Range Lookup too.
With the below Range condition.

Code: Select all

Input.Start > Lookup.Start And Input.End<Lookup.End
Key and Value being the other key.
With the below input,

Code: Select all

key value start      end 
1   a     1/1/2010   12/31/2010 
1   a     3/1/2010   9/30/2010
Would be be getting the following hit?

Code: Select all

key	value	Inp.Start	Inp.End	Lkp.Start	Lkp.End
1	a	1/1/2010	12/31/2010	3/1/2010	9/30/2010
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
varunndschange
Participant
Posts: 4
Joined: Mon Jul 05, 2010 1:21 am

Solution by using transformer stage

Post by varunndschange »

1. Use sort stage, apply hash partitioning on the key value and perform sort on main key and start date.

2. Use transformer stage and create two stage variables svPrevkey and svCurrkey as:
svPrevkey(with its initial value set to -1)=svCurrkey
svCurrkey=key

Again create two new stage variables svPrevEndDt and svCurrEndDt as:
svPrevEndDt(with its initial value set to '01-01-1900')=svCurrEndDt
svCurrEndDt=End Date

Then create another stage variable svResult:
svResult=If svPrevkey<>svCurrkey Then 'U' Else (If svPrevEnd<Start Date Then 'U' Else 'D')

Now.. define two output links and for that put the constraint like as below:
constraint on link1: svResult='U'(means only unique records)
constraint on link2: svResult='D'(means only duplicate records)

Try this out...
I think this can serve your purpose
Varun Negi
Assistant System Enginner
TCS
Post Reply