Strange Duplicate scenario

datastagenewbie · Post by **datastagenewbie** » Wed Aug 11, 2010 8:16 am

I have a strange duplicate scenario , below is my scenario.

key value start      end
1   a     1/1/2010   12/31/2010
1   a     3/1/2010   9/30/2010

In the above case it should duplicate, as there is a overlapping time period

Code: Select all

key value start      end
1   a     1/1/2010   12/31/2010
1   a     1/1/2011   12/31/2011

And in this case it should not

Any ideas of how it can be implemented in datastage??
THanks in advance.

kumar_s · Post by **kumar_s** » Wed Aug 11, 2010 8:24 am

Do a range look up of the same data and mark the looked up records as duplicate and reject it.
In the given example, the second record of the first set will get a match on the first record. And hence duplicate.

PhilHibbs · Post by **PhilHibbs** » Thu Aug 12, 2010 7:22 am

kumar_s wrote:Do a range look up of the same data and mark the looked up records as duplicate and reject it.
In the given example, the second record of the first set will get a match on the first record. And hence duplicate.

That sounds quite complicated. Do you mean, look up both the start and end date against every other pair of values other than itself? That would involve creating n-1 lookups for n rows of data, so if there are 20 rows, you need to create 19 lookups. If there are 100 rows, you need 99 lookups. I wouldn't have a clue how to build that.

chulett · Post by **chulett** » Thu Aug 12, 2010 7:29 am

People, don't add the tags that we can use here - bold, italics, code, etc - and then enable / check the Disable BBCode in this post option. Kind of defeats the whole purpose.

To the O/P, as I would say elsewhere: FTFY... fixed that for ya.

vivekgadwal · Post by **vivekgadwal** » Thu Aug 12, 2010 9:04 am

I wonder if the scenario is more complicated than I think it is! Please do let me know if I am overlooking something...

My suggestion - Use the "previous row - next row comparison" logic and within that, compare the year for the same key column (in your case, the first two fields). If they belong to the same year, then flag it as duplicate and then use a filter stage to filter off those flagged rows.

PhilHibbs · Post by **PhilHibbs** » Thu Aug 12, 2010 9:13 am

vivekgadwal wrote:My suggestion - Use the "previous row - next row comparison" logic and within that, compare the year for the same key column (in your case, the first two fields). If they belong to the same year, then flag it as duplicate and then use a filter stage to filter off those flagged rows.

They don't overlap just because the year is the same, just that the example happens to be ending on 12/31 and next one starting on 01/01 but I expect that is not a rule. I don't think this example would be an overlap:

Code: Select all

key value start      end
1   a     1/1/2010   10/31/2010
1   a     11/1/2010  12/31/2011

p.s. can we avoid locale-specific dates? It gets very confusing having the dd/mm the wrong way around... so the above would be:

Code: Select all

key value start      end
1   a     2010-01-01   2010-10-31
1   a     2010-11-01   2011-12-31

So if you sort your data by start and process it sequentially, you could check whether the start of a record is earlier than the end of the previous record. Would that do the trick?

kumar_s · Post by **kumar_s** » Thu Aug 12, 2010 9:13 am

PhilHibbs wrote:
kumar_s wrote:Do a range look up of the same data and mark the looked up records as duplicate and reject it.
In the given example, the second record of the first set will get a match on the first record. And hence duplicate.
That sounds quite complicated. Do you mean, look up both the start and end date against every other pair of values other than itself? That would involve creating n-1 lookups for n rows of data, so if there are 20 rows, you need to create 19 lookups. If there are 100 rows, you need 99 lookups. I wouldn't have a clue how to build that.

No, just do one lookup with Key and Value as lookup key. And do the range check in Dates. If Input.Start > Lookup.Start And Input.End<Lookup.End.
Those records which satisfies the condition and the lookup are the duplicates.

The range check can be done in seperate Transformer too, if it was a different version other than 8x.

Am I making any sense?

PhilHibbs · Post by **PhilHibbs** » Thu Aug 12, 2010 9:19 am

kumar_s wrote:Am I making any sense?

Not to me. That logic would only detect a range that is entirely within another range:

Code: Select all

Row 1:           <----------------->
Row 2:              <------------>

...not overlaps like this:

Code: Select all

Row 1:           <--------------->
Row 2:              <-------------->

...or this:

Code: Select all

Row 1:           <--------------->
Row 2:           <--------------->

(where start and end date are the same on two different rows)

PhilHibbs · Post by **PhilHibbs** » Thu Aug 12, 2010 9:20 am

no idea why I'm getting dup posts...

kumar_s · Post by **kumar_s** » Thu Aug 12, 2010 9:34 am

Perhaps Iam missing something somewhere...

May be the Lookup might be the confusion. It should be Join (With cross product) Full outer.

Code: Select all

If Input.Start > Lookup.Start And Input.End<Lookup.End Then 1 Else 0

Wouldn't the above code results to 1 for the first scenario that you had given and 0 for second.
For Third OP has to decide wheather its a duplicate or not.

For a each set of keys, there will be a cross product join.
For two records set for given key 1 and Value a There will be four records to operate.

Eg

Code: Select all

key	value	Inp.Start	Inp.End	Lkp.Start	Lkp.End	Inp.Unique	Lkp.Unique
1	a	1/1/2010	12/31/2010	1/1/2010	12/31/2010	1234	1234
1	a	 1/1/2011	12/31/2011	1/1/2010	12/31/2010	1235	1234
1	a	1/1/2010	12/31/2010	 1/1/2011	12/31/2011	1234	1235
1	a	 1/1/2011	12/31/2011	 1/1/2011	12/31/2011	1235	1235

Let the Unique field be either other available key in the input or self created one. So the records with

Code: Select all

Inp.Unique = Lkp.Unique

has to discarded at first setp.

With the above records set, would we be albe to find the duplicates?

kumar_s · Post by **kumar_s** » Thu Aug 12, 2010 9:38 am

PhilHibbs wrote:no idea why I'm getting dup posts...

It happens. You can delete the latest post, if you like to.
Not anymore, as there got some replies upon it.

PhilHibbs · Post by **PhilHibbs** » Thu Aug 12, 2010 9:40 am

kumar_s wrote:Perhaps Iam missing something somewhere...

May be the Lookup might be the confusion. It should be Join (With cross product) Full outer.

Ah, that makes more sense - that is, essentially, doing what I said - referencing each row against a different set of data.

Code: Select all

If Input.Start > Lookup.Start And Input.End<Lookup.End Then 1 Else 0

kumar_s wrote:Wouldn't the above code results to 1 for the first scenario that you had given and 0 for second.

I guess it depends on what the OP defines as an overlap.

PhilHibbs · Post by **PhilHibbs** » Thu Aug 12, 2010 9:42 am

another dup...

kumar_s · Post by **kumar_s** » Thu Aug 12, 2010 9:47 am

This can be accomplished using Range Lookup too.
With the below Range condition.

Code: Select all

Input.Start > Lookup.Start And Input.End<Lookup.End

Key and Value being the other key.
With the below input,

Code: Select all

key value start      end 
1   a     1/1/2010   12/31/2010 
1   a     3/1/2010   9/30/2010

Would be be getting the following hit?

Code: Select all

key	value	Inp.Start	Inp.End	Lkp.Start	Lkp.End
1	a	1/1/2010	12/31/2010	3/1/2010	9/30/2010

varunndschange · Post by **varunndschange** » Sun Sep 12, 2010 8:03 am

1. Use sort stage, apply hash partitioning on the key value and perform sort on main key and start date.

2. Use transformer stage and create two stage variables svPrevkey and svCurrkey as:
svPrevkey(with its initial value set to -1)=svCurrkey
svCurrkey=key

Again create two new stage variables svPrevEndDt and svCurrEndDt as:
svPrevEndDt(with its initial value set to '01-01-1900')=svCurrEndDt
svCurrEndDt=End Date

Then create another stage variable svResult:
svResult=If svPrevkey<>svCurrkey Then 'U' Else (If svPrevEnd<Start Date Then 'U' Else 'D')

Now.. define two output links and for that put the constraint like as below:
constraint on link1: svResult='U'(means only unique records)
constraint on link2: svResult='D'(means only duplicate records)

Try this out...
I think this can serve your purpose

DSXchange

Strange Duplicate scenario

Strange Duplicate scenario

Solution by using transformer stage