Need to retain unique records

pdntsap · Post by **pdntsap** » Wed Aug 31, 2011 10:36 am

Hello,

I have a requirement where I need to retain unique records (records that are present only once in the source) and discard all records that appear two or more times at the source. If a record appears more than once, I need to discard all the records and not just remove dupliates of that particular record. Each record had n colums and I need to group the input by m columns and based on the grouped columns retain records that appear only once in the output.
Any suggestions will be highly appreciated.

Thanks.

DSguru2B · Post by **DSguru2B** » Wed Aug 31, 2011 12:36 pm

Apply the group by and count and filter out keys for records that have count = 1. Let this be a lookup to your source filtering on matched records.

pdntsap · Post by **pdntsap** » Wed Aug 31, 2011 1:01 pm

Thanks DSGuru.

I followed Shrikant's suggestion from the link below

viewtopic.php?t=106508

I am able to get the correct record count of the unique records after the filter stage. But the aggregator stage drops some of the non-unique records. Any suggestions on why the aggegator stage drops some non-unique records.

Thanks.

DSguru2B · Post by **DSguru2B** » Wed Aug 31, 2011 1:09 pm

The complete row is lost???
That is a problem. Look at your partitioning method. If the row is there with count > 1, it is getting filtered out and thats ok.

chulett · Post by **chulett** » Wed Aug 31, 2011 1:30 pm

Nulls, perhaps?

sandeepgs · Post by **sandeepgs** » Wed Aug 31, 2011 1:44 pm

Hi,

As per the requirement that was mentioend the apt design would be as mentioend below.

File------Aggregator----Transformer-----File

1. Before aggregator the data should be sorted and partitioned by the key columns that are used as grouping keys.

Note: Hash would be better as we have a mixure of diferrent columns with different metadata.

2. The columns which are used by group by shoud be not NULLABLE i.e. no null columns should not be used in the group by condition.

If the above approach is followed you can reach the expected result.

Thanks
Sandeepgs

pdntsap · Post by **pdntsap** » Wed Aug 31, 2011 2:16 pm

Yes DSGuru. The complete row is lost.

And like Craig and Sandeepgs suggest, nulls might be the reason. The grouping is done on 11 keys in the aggregator stage and one of the keys is nullable. So does this mean, that the record that has a null key column is being dropped?

Thanks.

sandeepgs · Post by **sandeepgs** » Wed Aug 31, 2011 2:21 pm

Hi

Technically speaking you cannot perform any type of logical or functional calculations of the null values except if you want check the value is NULL/NOT.

So now as your column got null value the group by function dosen't even recognized how to treat the data and so the record is dropped.

If you really want to test this replace the null with a value and test, the record should be considered for load.

Please mark the topic as resolved once the tests were complete.

Thanks,
Sandeepgs

pdntsap · Post by **pdntsap** » Thu Sep 01, 2011 7:06 am

Thanks Sandeep. But, one of the key columns can be nullable according to the business logic and so I cannot replace null with a value.

The aggregator drops some records and I then use a filter stage and then a join stage to get the non-key columns back. After the join stage, I get the same record count that must have been got after the aggregator stage if it had not dropped the records containing null key values.
In other words, the aggregator outputs 5 records and drops 5 records but after the join stage, the record count is 10. The join uses a right outer join with the right link containing the five records from the filter stage and the left link has thousands of records. Am I right in saying that the null key values in the join stage leads to duplicate records being outptut?

Thanks.

DSguru2B · Post by **DSguru2B** » Thu Sep 01, 2011 7:22 am

You dont have to replace null indefinitely. Just replace it before you feed it to the sort and aggregator stage and then replace it back.