Remove Duplicates

gsherry1 · Post by **gsherry1** » Tue Jul 19, 2005 11:10 am

Any reason why remove duplicates doesn't have an option to output the duplicated (removed) records on a separate output/reject link? It would definitely improve the usefulness of the component.

bcarlson · Post by **bcarlson** » Tue Jul 19, 2005 1:40 pm

You should post this to the Enhancement Wishlist forum. That would be a great option to have.

We ran into this sitatuation as well, and it was kind of a pain to capture the duplicates. We ended up adding a separate surrogate key to each record. Then made a copy of the input data (pre-dedup), then deduped on the main key. So now you have a complete set and a deduped set. Join the 2 together on the main key AND the surrogate key and you can now identify which records are kept and which were removed.

Not the simplest of procedures, but it does work. I agree, a reject link would be more appropriate.

That's my 2 cents...

Brad.

ukyrvd · Post by **ukyrvd** » Tue Jul 19, 2005 3:50 pm

Hi Brad,
I am following exactly the same approach.

Any comments on how it compares with TX stage variable approach for sorted input(determining duplicates by comparing with prev record)?? I have started working on EE very recently and when I ran into this problem, did a search on this forum. But, I dont know why, none of the gurus suggested this method.

bcarlson · Post by **bcarlson** » Tue Jul 19, 2005 4:56 pm

I have not used that approach (stage variables). Our project avoids using Transformers because they are not as fast as buildops, at least for high volume processing. However, in this case it may be applicable if it can do it faster.

Maybe someone else out there has tried both of these options and has compare performance? If so, let us know.

jack_dcy · Post by **jack_dcy** » Tue Jul 19, 2005 9:46 pm

bcarlson wrote:I have not used that approach (stage variables). Our project avoids using Transformers because they are not as fast as buildops, at least for high volume processing. However, in this case it may be applicable if it can do it faster.

Maybe someone else out there has tried both of these options and has compare performance? If so, let us know.

Hi bcarlson,

Which type buildops do you used, Custom,Build or Wrapped?

bcarlson · Post by **bcarlson** » Wed Jul 20, 2005 8:25 am

jack_dcy wrote:Hi bcarlson,

Which type buildops do you used, Custom,Build or Wrapped?

We use Build - you build an input and output schema, with some code in the middle and let PX put it all together. Works great, and that way we can put the vast majority of our business rules and ETL in one stage rather than spread out over many stages or even multiple jobs.

lshort · Post by **lshort** » Wed Nov 16, 2005 10:25 am

I have used a slightly different approach.

Two Paths to a Difference stage.

Path 1:Datatset ---> RemoveDup ---> Difference

Path 2:Same starting Dataset ---> Difference

Result: Rows that were removed.

My only problem with this result is that Id like the whole set. For any dups I want to redirect the original and its duplicates. This only give me the duplicates.

kumar_s · Post by **kumar_s** » Thu Nov 17, 2005 3:31 am

Hi,
We too had the same cirucumstance for testing purpose.
Fortunately i had the output in sequential file.
I made use of "diff" command in unix. "grep"ed for < (or >) and redirected to a file.
This was quite simple and fast.

-Kumar

DSXchange

Remove Duplicates

Remove Duplicates

Slightly different approach