Remove duplicates stage on data sorted on different keys

dohertys · Post by **dohertys** » Fri Nov 23, 2007 9:34 am

I've working on some code where I need to sort the data on 1 set of keys, and then remove adjactent rows where some other fields have identical values...

i.e.
For this input data...
Key1 DateKey Value
1 2007-01-04 A
1 2007-01-03 B
1 2007-01-02 B
1 2007-01-01 A

I want the following output
Key1 DateKey Value
1 2007-01-04 A
1 2007-01-02 B
1 2007-01-01 A

i.e I've removed the 3rd record from the input set, because its got the same Value as the row next to it (when sorted by Key1 and DateKey).

I've coded this by sorting the data on Key1 and DateKey, and then passing it through a remove duplicates - with the duplicate keys being Key1 and Value.

So, the keys I use in my remove duplicates stage are not the same as the sort keys.

This seems to work ok, but I get the following message

Remove_Duplicates_239: When checking operator: User inserted sort "srt_ForRecEdt" does not fulfill the sort requirements of the downstream operator "Remove_Duplicates_239"

Is there a way to stop it producing this warning? Can you set something to make the remove duplicates think its sorted as it wants?

I have tried to code it in other ways which do not produce this warning, but the other ways seem to be slower (and more complicated).

Thanks

dsusr · Post by **dsusr** » Fri Nov 23, 2007 10:18 am

Can you please post the orchestrate code for this ????

JeroenDmt · Post by **JeroenDmt** » Fri Nov 23, 2007 10:24 am

Why don;t you turn it around?
If you remove the duplicates first and after that you sort it the way you want the result to be sorted, it shouldn't be a problem.

dohertys · Post by **dohertys** » Fri Nov 23, 2007 10:40 am

dsusr - here's that code...
(what I refer to as Key1 in the example is actually called ARRG_ID in the code, and what I refer to as DateKey in the example is actually called REC_ACDT in the code)

I've removed the code from here - it's dull

dohertys · Post by **dohertys** » Fri Nov 23, 2007 10:50 am

JeroenDmt wrote:Why don;t you turn it around?
If you remove the duplicates first and after that you sort it the way you want the result to be sorted, it shouldn't be a problem.

JeroenDmt, the problem is, I don't really want to remove all duplicate rows (based on the Key1 and Value), I only want to remove a duplicate if the rows are next to each other when I sort by Key1 and DateKey.

i.e In the example, I only want to remove the row with the DateKey 2007-01-03. I do not want to remove the row with the DateKey 2007-01-04 - even though it has the same Key1 and DateKey as the row with DateKey 2007-01-01.

Key1 DateKey Value
1 2007-01-04 A (do not get rid of this row, because it does not duplicate the data on the row next to it)
1 2007-01-03 B (get rid of this row - because it duplicates the row next to it)
1 2007-01-02 B
1 2007-01-01 A

Hope this makes it clearer.

Thanks

dsusr · Post by **dsusr** » Fri Nov 23, 2007 11:01 am

Sorry at the first instance I didn't get your question and thought that you are doing something else so have asked the Osh code.

But now for your problem you will have to apply a logic in transformer after the sort stage to check if the previous record has same key and value. If yes then exclude that record otherwise write the record to the output.

------
dsusr

landaghaar · Post by **landaghaar** » Fri Nov 23, 2007 11:09 am

sort is needed for duplicate remover based on the keys to be considered during removing, not primary keys or business keys.

in your case sort based on Value, not Key1, because in your example, for removing a duplicate, your key is value not key1. then sort it based on key1 if you need it.

same as joins, if you join based on a key, you have to sort it based on that join key.

dohertys · Post by **dohertys** » Fri Nov 23, 2007 11:11 am

ok - thanks.

dsusr, I'd already done what you've suggested (and another couple of different ways of doing it) - and it gives the right output without the warning - but its quite a lot slower (job start up time is about twice as long, and job run time is about 3 times as long), and this isn't as quite straight forward to follow on the canvas.

I'm trying to decide whether to ...

1. To keep my orginal job and accept the warning message (bit dodgy)
2. Use a different design, and accept a more complicated and slower job.

What do you reckon?

ray.wurlod · Post by **ray.wurlod** » Fri Nov 23, 2007 1:55 pm

This one looks like a good candidate for a local message handler, to demote the warning to an informational message. It's not strictly required that the upstream sort be exactly right for the RemDups stage (as you have demonstrated); but DataStage alerts you when it isn't.

DSXchange

Remove duplicates stage on data sorted on different keys

Remove duplicates stage on data sorted on different keys

Re: Remove duplicates stage on data sorted on different keys