Hi,
I need a help in doing this logic.
I have a source table as follows.
Id seq type date
101 564 'A' 29/06/2007
101 564 'A' null
102 432 'D' null
102 784 'D' null
102 894 'C' 07/04/2008
I want the output as
Id Seq type date
101 564 'A' 29/06/2007
102 432 'D' null
102 784 'D' null
102 894 'C' 07/04/2008
If all the column values are same for the duplicate rows and the date column is not null then that row has to be populated.
If all the columns are not same and date column is null/ not null then it has to be populated.
How to do this...
Any help is appreciated..
Issue on duplicate records
Moderators: chulett, rschirm, roy
Issue on duplicate records
prabakaran.v
Normally, I would recommend the remdup stage, but I don't know how it would deal with NULLs. You tell the remdup stage to keep either the first or last occurrence of a repeated record, but can it handle NULLs? If so, does it show up as low or high when sorted with dates or other datatypes? You may want to experiment with that and see what remdup can do with NULLs.
If it can't handle nulls, there may be a way to get around it. Convert the NULLs to some non-standard date, like 1/1/2799 or 1/1/1900 and then use the remdup stage. For example, if you use 1/1/1900 as the NULL replacement, you would want to sort on id, seq, type and date in ascending order and keep the last occurence. That would keep a normal date and drop the default if both are present. Reverse the logic if you use 1/1/2799.
Once deduped, convert the default values back to NULLs.
Brad.
If it can't handle nulls, there may be a way to get around it. Convert the NULLs to some non-standard date, like 1/1/2799 or 1/1/1900 and then use the remdup stage. For example, if you use 1/1/1900 as the NULL replacement, you would want to sort on id, seq, type and date in ascending order and keep the last occurence. That would keep a normal date and drop the default if both are present. Reverse the logic if you use 1/1/2799.
Once deduped, convert the default values back to NULLs.
Brad.
It is not that I am addicted to coffee, it's just that I need it to survive.
-
- Premium Member
- Posts: 79
- Joined: Thu Mar 22, 2007 4:58 pm
- Location: USA
You don't specify what needs to happen if all the fields are the same except for date, and two records (or more) have the date populated with different values. The answer to that will significantly impact the solution.
However, this can be solved using a Sort followed by a Dedupe. The Sort can be setup to sort NULLs last or first. Then, in the dedupe stage, have it keep the opposite (first record or last record).
If you can have more than one record with the same composite key, you will need to use a transformer with constraints rather than a dedupe. But that does start to get a bit more complicated. No point working out the details of that if it isn't needed.
However, this can be solved using a Sort followed by a Dedupe. The Sort can be setup to sort NULLs last or first. Then, in the dedupe stage, have it keep the opposite (first record or last record).
If you can have more than one record with the same composite key, you will need to use a transformer with constraints rather than a dedupe. But that does start to get a bit more complicated. No point working out the details of that if it isn't needed.
Jack Thornton
----------------
Spectacular achievement is always preceded by spectacular preparation - Robert H. Schuller
----------------
Spectacular achievement is always preceded by spectacular preparation - Robert H. Schuller
"
You don't specify what needs to happen if all the fields are the same except for date, and two records (or more) have the date populated with different values."
Yeah.. you are right. If all the fileds are same except for date then those two records need to be populated.
Source
Id seq type date
101 564 'A' 29/06/2007
101 564 'A' null
101 564 'A' 13/09/2008
102 432 'D' null
102 784 'D' null
102 894 'C' 07/04/2008
I want the output as
Id Seq type date
101 564 'A' 29/06/2007
101 564 'A' 13/09/2008
102 432 'D' null
102 784 'D' null
102 894 'C' 07/04/2008
This is what I want.
Plz give some idea. Any help is appreciated.
You don't specify what needs to happen if all the fields are the same except for date, and two records (or more) have the date populated with different values."
Yeah.. you are right. If all the fileds are same except for date then those two records need to be populated.
Source
Id seq type date
101 564 'A' 29/06/2007
101 564 'A' null
101 564 'A' 13/09/2008
102 432 'D' null
102 784 'D' null
102 894 'C' 07/04/2008
I want the output as
Id Seq type date
101 564 'A' 29/06/2007
101 564 'A' 13/09/2008
102 432 'D' null
102 784 'D' null
102 894 'C' 07/04/2008
This is what I want.
Plz give some idea. Any help is appreciated.
prabakaran.v
-
- Premium Member
- Posts: 79
- Joined: Thu Mar 22, 2007 4:58 pm
- Location: USA
That does complicate it a little bit.
Option 1:
Input --- Filter === (2 outputs A,B)
Filter output A (Date <> NULL) --- Copy === (Funnel/Lookup)
Filter output B (Date = NULL) --- Dedupe (on all fields) --- Lookup
Lookup outputs === Stream output to copy stage, Reject to Funnel
Funnel --- Output
In English, it would be:
Split the input based on if the Date is populated or not.
If the date is populated, copy it to both a funnel and as a reference link to a lookup.
If the date is not populated, remove any duplicates and feed it into the lookup.
From the lookup, any records that match (on all fields except date) can be passed through to a copy stage with no further processing. Records that do not match can be rejected. Those rejected records move on for further processing.
Further processing consists of the funnel (from the date found copy stage) joining the 'rejected' NULL date records with the good records that have the date populated. The output from this funnel continues on for any additional processing required.
------------------------------------------------------------------------------------
Option 2
Input ----- Sort ----- Transform ---- Output
Sort everything by all [key] fields including Date. Ensure that the sort options are set to put NULLs last.
Feed the output of the sort into a transformer. Use stage variables to track when the composite key changes (all fields from the sort except Date). In your constraints on the output, check if the key is new or if the Date is populated.
I like the second way better, but I had already written up the first way when the second method came to mind.
Option 1:
Input --- Filter === (2 outputs A,B)
Filter output A (Date <> NULL) --- Copy === (Funnel/Lookup)
Filter output B (Date = NULL) --- Dedupe (on all fields) --- Lookup
Lookup outputs === Stream output to copy stage, Reject to Funnel
Funnel --- Output
In English, it would be:
Split the input based on if the Date is populated or not.
If the date is populated, copy it to both a funnel and as a reference link to a lookup.
If the date is not populated, remove any duplicates and feed it into the lookup.
From the lookup, any records that match (on all fields except date) can be passed through to a copy stage with no further processing. Records that do not match can be rejected. Those rejected records move on for further processing.
Further processing consists of the funnel (from the date found copy stage) joining the 'rejected' NULL date records with the good records that have the date populated. The output from this funnel continues on for any additional processing required.
------------------------------------------------------------------------------------
Option 2
Input ----- Sort ----- Transform ---- Output
Sort everything by all [key] fields including Date. Ensure that the sort options are set to put NULLs last.
Feed the output of the sort into a transformer. Use stage variables to track when the composite key changes (all fields from the sort except Date). In your constraints on the output, check if the key is new or if the Date is populated.
I like the second way better, but I had already written up the first way when the second method came to mind.
Jack Thornton
----------------
Spectacular achievement is always preceded by spectacular preparation - Robert H. Schuller
----------------
Spectacular achievement is always preceded by spectacular preparation - Robert H. Schuller
Re: Issue on duplicate records
use the sort stage with the 3 key cols. allow duplicates "False".
if should work.
Thx
Ram
if should work.
Thx
Ram
-
- Premium Member
- Posts: 79
- Joined: Thu Mar 22, 2007 4:58 pm
- Location: USA
Ram - that does not quite meet the requirement.
The requirement is that if the Date is populated, the record is kept. If the date is not populated (Date = NULL) the record is only kept if there is not another record with the same keys.
Or, to phrase it another way - delete a record with NULL date if there is another record with the same keys.
Just putting in a Remove Duplicates stage will not give what is desired (look at the first 2 output records):
The requirement is that if the Date is populated, the record is kept. If the date is not populated (Date = NULL) the record is only kept if there is not another record with the same keys.
Or, to phrase it another way - delete a record with NULL date if there is another record with the same keys.
Just putting in a Remove Duplicates stage will not give what is desired (look at the first 2 output records):
I want the output as
Id Seq type date
101 564 'A' 29/06/2007
101 564 'A' 13/09/2008
102 432 'D' null
102 784 'D' null
102 894 'C' 07/04/2008
Jack Thornton
----------------
Spectacular achievement is always preceded by spectacular preparation - Robert H. Schuller
----------------
Spectacular achievement is always preceded by spectacular preparation - Robert H. Schuller
You could also use the key change column functionality of the sort stage.
On a sort stage, on the input link set the key columns to be ID, Seq, Type and set the Create Key Change Column to yes. On the sorting keys tab add all the columns and set the sort type to previously sorted. Sort the date column with nulls last.
If the output keyChange column is 1 then you want to keep it. If the KeyChange column =0 then you would drop records which have a null value in the date column. (This check could be done in a filter or transform stage)
On a sort stage, on the input link set the key columns to be ID, Seq, Type and set the Create Key Change Column to yes. On the sorting keys tab add all the columns and set the sort type to previously sorted. Sort the date column with nulls last.
If the output keyChange column is 1 then you want to keep it. If the KeyChange column =0 then you would drop records which have a null value in the date column. (This check could be done in a filter or transform stage)