Page 1 of 1

Remove Duplicates

Posted: Tue Jul 19, 2005 11:10 am
by gsherry1
Any reason why remove duplicates doesn't have an option to output the duplicated (removed) records on a separate output/reject link? It would definitely improve the usefulness of the component.

Posted: Tue Jul 19, 2005 1:40 pm
by bcarlson
You should post this to the Enhancement Wishlist forum. That would be a great option to have.

We ran into this sitatuation as well, and it was kind of a pain to capture the duplicates. We ended up adding a separate surrogate key to each record. Then made a copy of the input data (pre-dedup), then deduped on the main key. So now you have a complete set and a deduped set. Join the 2 together on the main key AND the surrogate key and you can now identify which records are kept and which were removed.

Not the simplest of procedures, but it does work. I agree, a reject link would be more appropriate.

That's my 2 cents...

Brad.

Posted: Tue Jul 19, 2005 3:50 pm
by ukyrvd
Hi Brad,
I am following exactly the same approach.

Any comments on how it compares with TX stage variable approach for sorted input(determining duplicates by comparing with prev record)?? I have started working on EE very recently and when I ran into this problem, did a search on this forum. But, I dont know why, none of the gurus suggested this method.

Posted: Tue Jul 19, 2005 4:56 pm
by bcarlson
I have not used that approach (stage variables). Our project avoids using Transformers because they are not as fast as buildops, at least for high volume processing. However, in this case it may be applicable if it can do it faster.

Maybe someone else out there has tried both of these options and has compare performance? If so, let us know.

Posted: Tue Jul 19, 2005 9:46 pm
by jack_dcy
bcarlson wrote:I have not used that approach (stage variables). Our project avoids using Transformers because they are not as fast as buildops, at least for high volume processing. However, in this case it may be applicable if it can do it faster.

Maybe someone else out there has tried both of these options and has compare performance? If so, let us know.

Hi bcarlson,

Which type buildops do you used, Custom,Build or Wrapped?

Posted: Wed Jul 20, 2005 8:25 am
by bcarlson
jack_dcy wrote:Hi bcarlson,

Which type buildops do you used, Custom,Build or Wrapped?
We use Build - you build an input and output schema, with some code in the middle and let PX put it all together. Works great, and that way we can put the vast majority of our business rules and ETL in one stage rather than spread out over many stages or even multiple jobs.

Slightly different approach

Posted: Wed Nov 16, 2005 10:25 am
by lshort
I have used a slightly different approach.

Two Paths to a Difference stage.

Path 1:Datatset ---> RemoveDup ---> Difference

Path 2:Same starting Dataset ---> Difference

Result: Rows that were removed.

My only problem with this result is that Id like the whole set. For any dups I want to redirect the original and its duplicates. This only give me the duplicates.

Posted: Thu Nov 17, 2005 3:31 am
by kumar_s
Hi,
We too had the same cirucumstance for testing purpose.
Fortunately i had the output in sequential file.
I made use of "diff" command in unix. "grep"ed for < (or >) and redirected to a file.
This was quite simple and fast.

-Kumar