Need to retain unique records
Moderators: chulett, rschirm, roy
Need to retain unique records
Hello,
I have a requirement where I need to retain unique records (records that are present only once in the source) and discard all records that appear two or more times at the source. If a record appears more than once, I need to discard all the records and not just remove dupliates of that particular record. Each record had n colums and I need to group the input by m columns and based on the grouped columns retain records that appear only once in the output.
Any suggestions will be highly appreciated.
Thanks.
I have a requirement where I need to retain unique records (records that are present only once in the source) and discard all records that appear two or more times at the source. If a record appears more than once, I need to discard all the records and not just remove dupliates of that particular record. Each record had n colums and I need to group the input by m columns and based on the grouped columns retain records that appear only once in the output.
Any suggestions will be highly appreciated.
Thanks.
Thanks DSGuru.
I followed Shrikant's suggestion from the link below
viewtopic.php?t=106508
I am able to get the correct record count of the unique records after the filter stage. But the aggregator stage drops some of the non-unique records. Any suggestions on why the aggegator stage drops some non-unique records.
Thanks.
I followed Shrikant's suggestion from the link below
viewtopic.php?t=106508
I am able to get the correct record count of the unique records after the filter stage. But the aggregator stage drops some of the non-unique records. Any suggestions on why the aggegator stage drops some non-unique records.
Thanks.
Hi,
As per the requirement that was mentioend the apt design would be as mentioend below.
File------Aggregator----Transformer-----File
1. Before aggregator the data should be sorted and partitioned by the key columns that are used as grouping keys.
Note: Hash would be better as we have a mixure of diferrent columns with different metadata.
2. The columns which are used by group by shoud be not NULLABLE i.e. no null columns should not be used in the group by condition.
If the above approach is followed you can reach the expected result.
Thanks
Sandeepgs
As per the requirement that was mentioend the apt design would be as mentioend below.
File------Aggregator----Transformer-----File
1. Before aggregator the data should be sorted and partitioned by the key columns that are used as grouping keys.
Note: Hash would be better as we have a mixure of diferrent columns with different metadata.
2. The columns which are used by group by shoud be not NULLABLE i.e. no null columns should not be used in the group by condition.
If the above approach is followed you can reach the expected result.
Thanks
Sandeepgs
Hi
Technically speaking you cannot perform any type of logical or functional calculations of the null values except if you want check the value is NULL/NOT.
So now as your column got null value the group by function dosen't even recognized how to treat the data and so the record is dropped.
If you really want to test this replace the null with a value and test, the record should be considered for load.
Please mark the topic as resolved once the tests were complete.
Thanks,
Sandeepgs
Technically speaking you cannot perform any type of logical or functional calculations of the null values except if you want check the value is NULL/NOT.
So now as your column got null value the group by function dosen't even recognized how to treat the data and so the record is dropped.
If you really want to test this replace the null with a value and test, the record should be considered for load.
Please mark the topic as resolved once the tests were complete.
Thanks,
Sandeepgs
Thanks Sandeep. But, one of the key columns can be nullable according to the business logic and so I cannot replace null with a value.
The aggregator drops some records and I then use a filter stage and then a join stage to get the non-key columns back. After the join stage, I get the same record count that must have been got after the aggregator stage if it had not dropped the records containing null key values.
In other words, the aggregator outputs 5 records and drops 5 records but after the join stage, the record count is 10. The join uses a right outer join with the right link containing the five records from the filter stage and the left link has thousands of records. Am I right in saying that the null key values in the join stage leads to duplicate records being outptut?
Thanks.
The aggregator drops some records and I then use a filter stage and then a join stage to get the non-key columns back. After the join stage, I get the same record count that must have been got after the aggregator stage if it had not dropped the records containing null key values.
In other words, the aggregator outputs 5 records and drops 5 records but after the join stage, the record count is 10. The join uses a right outer join with the right link containing the five records from the filter stage and the left link has thousands of records. Am I right in saying that the null key values in the join stage leads to duplicate records being outptut?
Thanks.