Remove Duplicates

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
gsherry1
Charter Member
Charter Member
Posts: 173
Joined: Fri Jun 17, 2005 8:31 am
Location: Canada

Remove Duplicates

Post by gsherry1 »

Any reason why remove duplicates doesn't have an option to output the duplicated (removed) records on a separate output/reject link? It would definitely improve the usefulness of the component.
bcarlson
Premium Member
Premium Member
Posts: 772
Joined: Fri Oct 01, 2004 3:06 pm
Location: Minnesota

Post by bcarlson »

You should post this to the Enhancement Wishlist forum. That would be a great option to have.

We ran into this sitatuation as well, and it was kind of a pain to capture the duplicates. We ended up adding a separate surrogate key to each record. Then made a copy of the input data (pre-dedup), then deduped on the main key. So now you have a complete set and a deduped set. Join the 2 together on the main key AND the surrogate key and you can now identify which records are kept and which were removed.

Not the simplest of procedures, but it does work. I agree, a reject link would be more appropriate.

That's my 2 cents...

Brad.
ukyrvd
Premium Member
Premium Member
Posts: 73
Joined: Thu Feb 10, 2005 10:59 am

Post by ukyrvd »

Hi Brad,
I am following exactly the same approach.

Any comments on how it compares with TX stage variable approach for sorted input(determining duplicates by comparing with prev record)?? I have started working on EE very recently and when I ran into this problem, did a search on this forum. But, I dont know why, none of the gurus suggested this method.
bcarlson
Premium Member
Premium Member
Posts: 772
Joined: Fri Oct 01, 2004 3:06 pm
Location: Minnesota

Post by bcarlson »

I have not used that approach (stage variables). Our project avoids using Transformers because they are not as fast as buildops, at least for high volume processing. However, in this case it may be applicable if it can do it faster.

Maybe someone else out there has tried both of these options and has compare performance? If so, let us know.
jack_dcy
Participant
Posts: 18
Joined: Wed Jun 29, 2005 9:53 pm

Post by jack_dcy »

bcarlson wrote:I have not used that approach (stage variables). Our project avoids using Transformers because they are not as fast as buildops, at least for high volume processing. However, in this case it may be applicable if it can do it faster.

Maybe someone else out there has tried both of these options and has compare performance? If so, let us know.

Hi bcarlson,

Which type buildops do you used, Custom,Build or Wrapped?
bcarlson
Premium Member
Premium Member
Posts: 772
Joined: Fri Oct 01, 2004 3:06 pm
Location: Minnesota

Post by bcarlson »

jack_dcy wrote:Hi bcarlson,

Which type buildops do you used, Custom,Build or Wrapped?
We use Build - you build an input and output schema, with some code in the middle and let PX put it all together. Works great, and that way we can put the vast majority of our business rules and ETL in one stage rather than spread out over many stages or even multiple jobs.
lshort
Premium Member
Premium Member
Posts: 139
Joined: Tue Oct 29, 2002 11:40 am
Location: Toronto

Slightly different approach

Post by lshort »

I have used a slightly different approach.

Two Paths to a Difference stage.

Path 1:Datatset ---> RemoveDup ---> Difference

Path 2:Same starting Dataset ---> Difference

Result: Rows that were removed.

My only problem with this result is that Id like the whole set. For any dups I want to redirect the original and its duplicates. This only give me the duplicates.
Lance Short
"infinite diversity in infinite combinations"
***
"The absence of evidence is not evidence of absence."
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Hi,
We too had the same cirucumstance for testing purpose.
Fortunately i had the output in sequential file.
I made use of "diff" command in unix. "grep"ed for < (or >) and redirected to a file.
This was quite simple and fast.

-Kumar
Post Reply