Page 1 of 1

Need to retain unique records

Posted: Wed Aug 31, 2011 10:36 am
by pdntsap
Hello,

I have a requirement where I need to retain unique records (records that are present only once in the source) and discard all records that appear two or more times at the source. If a record appears more than once, I need to discard all the records and not just remove dupliates of that particular record. Each record had n colums and I need to group the input by m columns and based on the grouped columns retain records that appear only once in the output.
Any suggestions will be highly appreciated.

Thanks.

Posted: Wed Aug 31, 2011 12:36 pm
by DSguru2B
Apply the group by and count and filter out keys for records that have count = 1. Let this be a lookup to your source filtering on matched records.

Posted: Wed Aug 31, 2011 1:01 pm
by pdntsap
Thanks DSGuru.

I followed Shrikant's suggestion from the link below

viewtopic.php?t=106508

I am able to get the correct record count of the unique records after the filter stage. But the aggregator stage drops some of the non-unique records. Any suggestions on why the aggegator stage drops some non-unique records.

Thanks.

Posted: Wed Aug 31, 2011 1:09 pm
by DSguru2B
The complete row is lost???
That is a problem. Look at your partitioning method. If the row is there with count > 1, it is getting filtered out and thats ok.

Posted: Wed Aug 31, 2011 1:30 pm
by chulett
Nulls, perhaps?

Posted: Wed Aug 31, 2011 1:44 pm
by sandeepgs
Hi,

As per the requirement that was mentioend the apt design would be as mentioend below.

File------Aggregator----Transformer-----File

1. Before aggregator the data should be sorted and partitioned by the key columns that are used as grouping keys.

Note: Hash would be better as we have a mixure of diferrent columns with different metadata.

2. The columns which are used by group by shoud be not NULLABLE i.e. no null columns should not be used in the group by condition.

If the above approach is followed you can reach the expected result.

Thanks
Sandeepgs

Posted: Wed Aug 31, 2011 2:16 pm
by pdntsap
Yes DSGuru. The complete row is lost.

And like Craig and Sandeepgs suggest, nulls might be the reason. The grouping is done on 11 keys in the aggregator stage and one of the keys is nullable. So does this mean, that the record that has a null key column is being dropped?

Thanks.

Posted: Wed Aug 31, 2011 2:21 pm
by sandeepgs
Hi

Technically speaking you cannot perform any type of logical or functional calculations of the null values except if you want check the value is NULL/NOT.

So now as your column got null value the group by function dosen't even recognized how to treat the data and so the record is dropped.

If you really want to test this replace the null with a value and test, the record should be considered for load.

Please mark the topic as resolved once the tests were complete.

Thanks,
Sandeepgs

Posted: Thu Sep 01, 2011 7:06 am
by pdntsap
Thanks Sandeep. But, one of the key columns can be nullable according to the business logic and so I cannot replace null with a value.

The aggregator drops some records and I then use a filter stage and then a join stage to get the non-key columns back. After the join stage, I get the same record count that must have been got after the aggregator stage if it had not dropped the records containing null key values.
In other words, the aggregator outputs 5 records and drops 5 records but after the join stage, the record count is 10. The join uses a right outer join with the right link containing the five records from the filter stage and the left link has thousands of records. Am I right in saying that the null key values in the join stage leads to duplicate records being outptut?

Thanks.

Posted: Thu Sep 01, 2011 7:22 am
by DSguru2B
You dont have to replace null indefinitely. Just replace it before you feed it to the sort and aggregator stage and then replace it back.