Need to retain unique records

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
pdntsap
Premium Member
Premium Member
Posts: 107
Joined: Mon Jul 04, 2011 5:38 pm

Need to retain unique records

Post by pdntsap »

Hello,

I have a requirement where I need to retain unique records (records that are present only once in the source) and discard all records that appear two or more times at the source. If a record appears more than once, I need to discard all the records and not just remove dupliates of that particular record. Each record had n colums and I need to group the input by m columns and based on the grouped columns retain records that appear only once in the output.
Any suggestions will be highly appreciated.

Thanks.
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

Apply the group by and count and filter out keys for records that have count = 1. Let this be a lookup to your source filtering on matched records.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
pdntsap
Premium Member
Premium Member
Posts: 107
Joined: Mon Jul 04, 2011 5:38 pm

Post by pdntsap »

Thanks DSGuru.

I followed Shrikant's suggestion from the link below

viewtopic.php?t=106508

I am able to get the correct record count of the unique records after the filter stage. But the aggregator stage drops some of the non-unique records. Any suggestions on why the aggegator stage drops some non-unique records.

Thanks.
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

The complete row is lost???
That is a problem. Look at your partitioning method. If the row is there with count > 1, it is getting filtered out and thats ok.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Nulls, perhaps?
-craig

"You can never have too many knives" -- Logan Nine Fingers
sandeepgs
Participant
Posts: 87
Joined: Wed Jul 02, 2008 12:22 am

Post by sandeepgs »

Hi,

As per the requirement that was mentioend the apt design would be as mentioend below.

File------Aggregator----Transformer-----File

1. Before aggregator the data should be sorted and partitioned by the key columns that are used as grouping keys.

Note: Hash would be better as we have a mixure of diferrent columns with different metadata.

2. The columns which are used by group by shoud be not NULLABLE i.e. no null columns should not be used in the group by condition.

If the above approach is followed you can reach the expected result.

Thanks
Sandeepgs
pdntsap
Premium Member
Premium Member
Posts: 107
Joined: Mon Jul 04, 2011 5:38 pm

Post by pdntsap »

Yes DSGuru. The complete row is lost.

And like Craig and Sandeepgs suggest, nulls might be the reason. The grouping is done on 11 keys in the aggregator stage and one of the keys is nullable. So does this mean, that the record that has a null key column is being dropped?

Thanks.
sandeepgs
Participant
Posts: 87
Joined: Wed Jul 02, 2008 12:22 am

Post by sandeepgs »

Hi

Technically speaking you cannot perform any type of logical or functional calculations of the null values except if you want check the value is NULL/NOT.

So now as your column got null value the group by function dosen't even recognized how to treat the data and so the record is dropped.

If you really want to test this replace the null with a value and test, the record should be considered for load.

Please mark the topic as resolved once the tests were complete.

Thanks,
Sandeepgs
pdntsap
Premium Member
Premium Member
Posts: 107
Joined: Mon Jul 04, 2011 5:38 pm

Post by pdntsap »

Thanks Sandeep. But, one of the key columns can be nullable according to the business logic and so I cannot replace null with a value.

The aggregator drops some records and I then use a filter stage and then a join stage to get the non-key columns back. After the join stage, I get the same record count that must have been got after the aggregator stage if it had not dropped the records containing null key values.
In other words, the aggregator outputs 5 records and drops 5 records but after the join stage, the record count is 10. The join uses a right outer join with the right link containing the five records from the filter stage and the left link has thousands of records. Am I right in saying that the null key values in the join stage leads to duplicate records being outptut?

Thanks.
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

You dont have to replace null indefinitely. Just replace it before you feed it to the sort and aggregator stage and then replace it back.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
Post Reply