Remove Duplicates
Moderators: chulett, rschirm, roy
Remove Duplicates
Any reason why remove duplicates doesn't have an option to output the duplicated (removed) records on a separate output/reject link? It would definitely improve the usefulness of the component.
You should post this to the Enhancement Wishlist forum. That would be a great option to have.
We ran into this sitatuation as well, and it was kind of a pain to capture the duplicates. We ended up adding a separate surrogate key to each record. Then made a copy of the input data (pre-dedup), then deduped on the main key. So now you have a complete set and a deduped set. Join the 2 together on the main key AND the surrogate key and you can now identify which records are kept and which were removed.
Not the simplest of procedures, but it does work. I agree, a reject link would be more appropriate.
That's my 2 cents...
Brad.
We ran into this sitatuation as well, and it was kind of a pain to capture the duplicates. We ended up adding a separate surrogate key to each record. Then made a copy of the input data (pre-dedup), then deduped on the main key. So now you have a complete set and a deduped set. Join the 2 together on the main key AND the surrogate key and you can now identify which records are kept and which were removed.
Not the simplest of procedures, but it does work. I agree, a reject link would be more appropriate.
That's my 2 cents...
Brad.
Hi Brad,
I am following exactly the same approach.
Any comments on how it compares with TX stage variable approach for sorted input(determining duplicates by comparing with prev record)?? I have started working on EE very recently and when I ran into this problem, did a search on this forum. But, I dont know why, none of the gurus suggested this method.
I am following exactly the same approach.
Any comments on how it compares with TX stage variable approach for sorted input(determining duplicates by comparing with prev record)?? I have started working on EE very recently and when I ran into this problem, did a search on this forum. But, I dont know why, none of the gurus suggested this method.
I have not used that approach (stage variables). Our project avoids using Transformers because they are not as fast as buildops, at least for high volume processing. However, in this case it may be applicable if it can do it faster.
Maybe someone else out there has tried both of these options and has compare performance? If so, let us know.
Maybe someone else out there has tried both of these options and has compare performance? If so, let us know.
bcarlson wrote:I have not used that approach (stage variables). Our project avoids using Transformers because they are not as fast as buildops, at least for high volume processing. However, in this case it may be applicable if it can do it faster.
Maybe someone else out there has tried both of these options and has compare performance? If so, let us know.
Hi bcarlson,
Which type buildops do you used, Custom,Build or Wrapped?
We use Build - you build an input and output schema, with some code in the middle and let PX put it all together. Works great, and that way we can put the vast majority of our business rules and ETL in one stage rather than spread out over many stages or even multiple jobs.jack_dcy wrote:Hi bcarlson,
Which type buildops do you used, Custom,Build or Wrapped?
Slightly different approach
I have used a slightly different approach.
Two Paths to a Difference stage.
Path 1:Datatset ---> RemoveDup ---> Difference
Path 2:Same starting Dataset ---> Difference
Result: Rows that were removed.
My only problem with this result is that Id like the whole set. For any dups I want to redirect the original and its duplicates. This only give me the duplicates.
Two Paths to a Difference stage.
Path 1:Datatset ---> RemoveDup ---> Difference
Path 2:Same starting Dataset ---> Difference
Result: Rows that were removed.
My only problem with this result is that Id like the whole set. For any dups I want to redirect the original and its duplicates. This only give me the duplicates.
Lance Short
"infinite diversity in infinite combinations"
***
"The absence of evidence is not evidence of absence."
"infinite diversity in infinite combinations"
***
"The absence of evidence is not evidence of absence."