Issue on duplicate records

praburaj · Post by **praburaj** » Thu Jul 23, 2009 10:42 am

Hi,

I need a help in doing this logic.
I have a source table as follows.

Id seq type date
101 564 'A' 29/06/2007
101 564 'A' null
102 432 'D' null
102 784 'D' null
102 894 'C' 07/04/2008

I want the output as

Id Seq type date
101 564 'A' 29/06/2007
102 432 'D' null
102 784 'D' null
102 894 'C' 07/04/2008

If all the column values are same for the duplicate rows and the date column is not null then that row has to be populated.

If all the columns are not same and date column is null/ not null then it has to be populated.

How to do this...

Any help is appreciated..

bcarlson · Post by **bcarlson** » Thu Jul 23, 2009 11:48 am

Normally, I would recommend the remdup stage, but I don't know how it would deal with NULLs. You tell the remdup stage to keep either the first or last occurrence of a repeated record, but can it handle NULLs? If so, does it show up as low or high when sorted with dates or other datatypes? You may want to experiment with that and see what remdup can do with NULLs.

If it can't handle nulls, there may be a way to get around it. Convert the NULLs to some non-standard date, like 1/1/2799 or 1/1/1900 and then use the remdup stage. For example, if you use 1/1/1900 as the NULL replacement, you would want to sort on id, seq, type and date in ascending order and keep the last occurence. That would keep a normal date and drop the default if both are present. Reverse the logic if you use 1/1/2799.

Once deduped, convert the default values back to NULLs.

Brad.

jcthornton · Post by **jcthornton** » Thu Jul 23, 2009 11:51 pm

You don't specify what needs to happen if all the fields are the same except for date, and two records (or more) have the date populated with different values. The answer to that will significantly impact the solution.

However, this can be solved using a Sort followed by a Dedupe. The Sort can be setup to sort NULLs last or first. Then, in the dedupe stage, have it keep the opposite (first record or last record).

If you can have more than one record with the same composite key, you will need to use a transformer with constraints rather than a dedupe. But that does start to get a bit more complicated. No point working out the details of that if it isn't needed.

praburaj · Post by **praburaj** » Fri Jul 24, 2009 1:12 am

"
You don't specify what needs to happen if all the fields are the same except for date, and two records (or more) have the date populated with different values."

Yeah.. you are right. If all the fileds are same except for date then those two records need to be populated.

Source

Id seq type date
101 564 'A' 29/06/2007
101 564 'A' null
101 564 'A' 13/09/2008
102 432 'D' null
102 784 'D' null
102 894 'C' 07/04/2008

I want the output as

Id Seq type date
101 564 'A' 29/06/2007
101 564 'A' 13/09/2008
102 432 'D' null
102 784 'D' null
102 894 'C' 07/04/2008

This is what I want.
Plz give some idea. Any help is appreciated.

jcthornton · Post by **jcthornton** » Fri Jul 24, 2009 10:24 am

That does complicate it a little bit.

Option 1:
Input --- Filter === (2 outputs A,B)

Filter output A (Date <> NULL) --- Copy === (Funnel/Lookup)
Filter output B (Date = NULL) --- Dedupe (on all fields) --- Lookup

Lookup outputs === Stream output to copy stage, Reject to Funnel

Funnel --- Output

In English, it would be:

Split the input based on if the Date is populated or not.

If the date is populated, copy it to both a funnel and as a reference link to a lookup.

If the date is not populated, remove any duplicates and feed it into the lookup.

From the lookup, any records that match (on all fields except date) can be passed through to a copy stage with no further processing. Records that do not match can be rejected. Those rejected records move on for further processing.

Further processing consists of the funnel (from the date found copy stage) joining the 'rejected' NULL date records with the good records that have the date populated. The output from this funnel continues on for any additional processing required.

------------------------------------------------------------------------------------
Option 2

Input ----- Sort ----- Transform ---- Output

Sort everything by all [key] fields including Date. Ensure that the sort options are set to put NULLs last.

Feed the output of the sort into a transformer. Use stage variables to track when the composite key changes (all fields from the sort except Date). In your constraints on the output, check if the key is new or if the Date is populated.

I like the second way better, but I had already written up the first way when the second method came to mind.

reddysrc · Post by **reddysrc** » Fri Jul 24, 2009 10:28 am

use the sort stage with the 3 key cols. allow duplicates "False".

if should work.

Thx
Ram

jcthornton · Post by **jcthornton** » Fri Jul 24, 2009 10:47 am

Ram - that does not quite meet the requirement.

The requirement is that if the Date is populated, the record is kept. If the date is not populated (Date = NULL) the record is only kept if there is not another record with the same keys.

Or, to phrase it another way - delete a record with NULL date if there is another record with the same keys.

Just putting in a Remove Duplicates stage will not give what is desired (look at the first 2 output records):

I want the output as

Id Seq type date
101 564 'A' 29/06/2007
101 564 'A' 13/09/2008
102 432 'D' null
102 784 'D' null
102 894 'C' 07/04/2008

ShaneMuir · Post by **ShaneMuir** » Mon Jul 27, 2009 4:51 am

You could also use the key change column functionality of the sort stage.

On a sort stage, on the input link set the key columns to be ID, Seq, Type and set the Create Key Change Column to yes. On the sorting keys tab add all the columns and set the sort type to previously sorted. Sort the date column with nulls last.

If the output keyChange column is 1 then you want to keep it. If the KeyChange column =0 then you would drop records which have a null value in the date column. (This check could be done in a filter or transform stage)

DSXchange

Issue on duplicate records

Issue on duplicate records

Re: Issue on duplicate records