Remove duplicates stage on data sorted on different keys

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
dohertys
Participant
Posts: 39
Joined: Thu Oct 11, 2007 3:26 am
Location: Sheffield

Remove duplicates stage on data sorted on different keys

Post by dohertys »

I've working on some code where I need to sort the data on 1 set of keys, and then remove adjactent rows where some other fields have identical values...

i.e.
For this input data...
Key1 DateKey Value
1 2007-01-04 A
1 2007-01-03 B
1 2007-01-02 B
1 2007-01-01 A


I want the following output
Key1 DateKey Value
1 2007-01-04 A
1 2007-01-02 B
1 2007-01-01 A

i.e I've removed the 3rd record from the input set, because its got the same Value as the row next to it (when sorted by Key1 and DateKey).


I've coded this by sorting the data on Key1 and DateKey, and then passing it through a remove duplicates - with the duplicate keys being Key1 and Value.

So, the keys I use in my remove duplicates stage are not the same as the sort keys.

This seems to work ok, but I get the following message


Remove_Duplicates_239: When checking operator: User inserted sort "srt_ForRecEdt" does not fulfill the sort requirements of the downstream operator "Remove_Duplicates_239"

Is there a way to stop it producing this warning? Can you set something to make the remove duplicates think its sorted as it wants?

I have tried to code it in other ways which do not produce this warning, but the other ways seem to be slower (and more complicated).

Thanks
dsusr
Premium Member
Premium Member
Posts: 104
Joined: Sat Sep 03, 2005 11:30 pm

Re: Remove duplicates stage on data sorted on different keys

Post by dsusr »

Can you please post the orchestrate code for this ????
JeroenDmt
Premium Member
Premium Member
Posts: 107
Joined: Wed Oct 26, 2005 7:36 am

Post by JeroenDmt »

Why don;t you turn it around?
If you remove the duplicates first and after that you sort it the way you want the result to be sorted, it shouldn't be a problem.
dohertys
Participant
Posts: 39
Joined: Thu Oct 11, 2007 3:26 am
Location: Sheffield

Post by dohertys »

dsusr - here's that code...
(what I refer to as Key1 in the example is actually called ARRG_ID in the code, and what I refer to as DateKey in the example is actually called REC_ACDT in the code)


I've removed the code from here - it's dull
Last edited by dohertys on Mon Nov 26, 2007 3:13 am, edited 1 time in total.
dohertys
Participant
Posts: 39
Joined: Thu Oct 11, 2007 3:26 am
Location: Sheffield

Post by dohertys »

JeroenDmt wrote:Why don;t you turn it around?
If you remove the duplicates first and after that you sort it the way you want the result to be sorted, it shouldn't be a problem.
JeroenDmt, the problem is, I don't really want to remove all duplicate rows (based on the Key1 and Value), I only want to remove a duplicate if the rows are next to each other when I sort by Key1 and DateKey.

i.e In the example, I only want to remove the row with the DateKey 2007-01-03. I do not want to remove the row with the DateKey 2007-01-04 - even though it has the same Key1 and DateKey as the row with DateKey 2007-01-01.

Key1 DateKey Value
1 2007-01-04 A (do not get rid of this row, because it does not duplicate the data on the row next to it)
1 2007-01-03 B (get rid of this row - because it duplicates the row next to it)
1 2007-01-02 B
1 2007-01-01 A

Hope this makes it clearer.

Thanks
dsusr
Premium Member
Premium Member
Posts: 104
Joined: Sat Sep 03, 2005 11:30 pm

Post by dsusr »

Sorry at the first instance I didn't get your question and thought that you are doing something else so have asked the Osh code.

But now for your problem you will have to apply a logic in transformer after the sort stage to check if the previous record has same key and value. If yes then exclude that record otherwise write the record to the output.

------
dsusr
landaghaar
Participant
Posts: 38
Joined: Wed Sep 19, 2007 10:11 am
Location: Canada

Post by landaghaar »

sort is needed for duplicate remover based on the keys to be considered during removing, not primary keys or business keys.

in your case sort based on Value, not Key1, because in your example, for removing a duplicate, your key is value not key1. then sort it based on key1 if you need it.

same as joins, if you join based on a key, you have to sort it based on that join key.
dohertys
Participant
Posts: 39
Joined: Thu Oct 11, 2007 3:26 am
Location: Sheffield

Post by dohertys »

ok - thanks.

dsusr, I'd already done what you've suggested (and another couple of different ways of doing it) - and it gives the right output without the warning - but its quite a lot slower (job start up time is about twice as long, and job run time is about 3 times as long), and this isn't as quite straight forward to follow on the canvas.

I'm trying to decide whether to ...

1. To keep my orginal job and accept the warning message (bit dodgy)
2. Use a different design, and accept a more complicated and slower job.

What do you reckon?
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

This one looks like a good candidate for a local message handler, to demote the warning to an informational message. It's not strictly required that the upstream sort be exactly right for the RemDups stage (as you have demonstrated); but DataStage alerts you when it isn't.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply