Removing duplicates but not giving desired output

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
Ratan Babu N
Participant
Posts: 34
Joined: Tue Dec 27, 2005 12:13 am

Removing duplicates but not giving desired output

Post by Ratan Babu N »

Hi,

I am having a requirement to select a record based on following criteria:

Record layout
Column1,Column 2, Date1, Date2,Flag

Flag can consists of either 'Y' or 'N'.

Based on Column 1, Column 2 if we get more than one record in the source then always take the record whose Flag is 'Y'. If the two records has same value in Flag (either Y or N) then need to take the record whose Date2 is nearest to Date1. Date1 will be same for this set of records. Always take the future nearest date if we have Date2 one day prior to Date1 for one record and Date2 2 days after Date1 for another record. So we took the difference of Date1 and Date2 in Transformer stage, if it is less than 0 means Date2 is future date of Date1 (because we used DaysSinceFromDate(Date1,Date2) function) then we generated a new column ColumnY in transformer stage where the value will be set to 1 and if the value is greater than or equal to zero then value will be set to 2. and in Column Z the value of diffrence in dates is taken (If it is less than zero then multiply with -1 else same value).

Another ColumnX whose value is 1 when Flag is 'Y' Else 2

Then in the next Sort stage "Hash" Partition on Column1, Column2 but "Sort" on Column1,Column2, ColumnX, ColumnY and ColumnZ (All in ascending order).

Then in the Next Remove Duplicates Stage Partiton is "Same" and Duplicate to retain is "First".

This job worked fine in Development and Test environments but in Production some times it is giving desired output and some times not.

Can any one help me out please.
anshul_roy
Participant
Posts: 7
Joined: Tue Apr 21, 2009 2:27 am
Location: chennai

Re: Removing duplicates but not giving desired output

Post by anshul_roy »

Its pretty simple if the flag is different.
When the flag is same, DaysSinceFromDate(Date1,Date2) function will give -1,0 or 1 as output. Obviously, if the output for two rows are -1 & 1 then the one having -1 is the preferred one.

So, we can continue with this simple comparision. However, if the output for both the two rows are -1 or 1, then simply compare the column date2 of both the rows using the same function DaysSinceFromDate(Date2 row1,Date2 row2) to get which comes first.


Hence overall step is as follows:-

1) check if the flag is same or different.
2) If Different then the one having 'Y' is the one we have to take.
3) If Same, Compare the date1,date2 of all columns to come to -1,0 or 1 values and then compare them.
4) If the two rows has different set of values from -1,0,1 then its the one in -1 that we have to take and accordingly.
5) If both have same value, then compare the two rows date2 for the same function to get -1,0,1 and then similar comparision to get the one closer.
Anshul
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Re: Removing duplicates but not giving desired output

Post by chulett »

Ratan Babu N wrote:This job worked fine in Development and Test environments but in Production some times it is giving desired output and some times not.
First thought? What is different between the dev/test and production environments? Something, obviously. Could be as simple as production being the first place where the job runs over multiple nodes for all we know.

If you still need help, I'd suggest giving us a better idea what 'sometimes not' means. What is wrong with the output 'at times'? How is it different from what you desire? Help us help you - simply saying "it's wrong" doesn't really further that end.
-craig

"You can never have too many knives" -- Logan Nine Fingers
Ratan Babu N
Participant
Posts: 34
Joined: Tue Dec 27, 2005 12:13 am

Re: Removing duplicates but not giving desired output

Post by Ratan Babu N »

In all the environments we are running the jobs using 4 nodes configuration files.

For more clarity on the issue, please find below some examples of source data,

Record Column1 Column2 Date1 Date2 Flag
1 100 200 15-May-2009 24-May-2009 Y
2 100 200 15-May-2009 20-May-2009 N

In the above e.g, Record 1 should be fetched as the flag is 'Y'

Record Column1 Column2 Date1 Date2 Flag
1 100 200 15-May-2009 13-May-2009 Y
2 100 200 15-May-2009 10-May-2009 Y

In the above e.g, Record 1 should be fetched as Date2 is more nearer to Date1
eventhough it is a past date

Record Column1 Column2 Date1 Date2 Flag
1 100 200 15-May-2009 24-May-2009 Y
2 100 200 15-May-2009 20-May-2009 Y

In the above e.g, Record 2 should be fetched as Date2 is more nearer to Date1

Record Column1 Column2 Date1 Date2 Flag
1 100 200 15-May-2009 19-May-2009 N
2 100 200 15-May-2009 27-May-2009 N

In the above e.g, Record 1 should be fetched as Date2 is more nearer to Date1

Record Column1 Column2 Date1 Date2 Flag
1 100 200 15-May-2009 11-May-2009 Y
2 100 200 15-May-2009 26-May-2009 Y

In the above e.g, Record 2 should be fetched as in Record 2 Date2 is future
date of Date1

In the above examples, Record is not a column, its just for the reference to mention
as Record 1 or Record 2.
In all the above scenarios sometimes I am getting the expected results and
sometimes the incorrect records are fetched.

Let me know if more information is needed.
Sainath.Srinivasan
Participant
Posts: 3337
Joined: Mon Jan 17, 2005 4:49 am
Location: United Kingdom

Post by Sainath.Srinivasan »

What does the 'sometimes' have in common ?

What incorrect results do you receive? i.e. does it pick wrong record or both?
Ratan Babu N
Participant
Posts: 34
Joined: Tue Dec 27, 2005 12:13 am

Post by Ratan Babu N »

It is picking the wrong record but not both the records.

For example:
Record Column1 Column2 Date1 Date2 Flag
1 100 200 15-May-2009 24-May-2009 Y
2 100 200 15-May-2009 20-May-2009 Y

In the above e.g, Record 2 should be fetched as Date2 is more nearer to Date1. but some times it is fetching Record 1 which is not correct.
Sainath.Srinivasan
Participant
Posts: 3337
Joined: Mon Jan 17, 2005 4:49 am
Location: United Kingdom

Post by Sainath.Srinivasan »

Did you check whether this is the impact of 'First' in remove duplicates?
Ratan Babu N
Participant
Posts: 34
Joined: Tue Dec 27, 2005 12:13 am

Post by Ratan Babu N »

Sainath.Srinivasan wrote:Did you check whether this is the impact of 'First' in remove duplicates?
Since the Sort stage is "Hash" Partitioned on Column1, Column2 but "Sort" on Column1,Column2, ColumnX, ColumnY and ColumnZ (All in ascending order). And in the Next Remove Duplicates Stage Partiton is "Same" and Duplicate to retain is "First". Always the first record should be desired one right(As per the requirement)?

Is the sort order will change when the record reaches the Remove duplicates stage?
Sainath.Srinivasan
Participant
Posts: 3337
Joined: Mon Jan 17, 2005 4:49 am
Location: United Kingdom

Post by Sainath.Srinivasan »

What are your keys for remove duplicates ?
Ratan Babu N
Participant
Posts: 34
Joined: Tue Dec 27, 2005 12:13 am

Post by Ratan Babu N »

Sainath.Srinivasan wrote:What are your keys for remove duplicates ?
Column1 and Column2 i.e, In example:
Record Column1 Column2 Date1 Date2 Flag
1 100 200 15-May-2009 24-May-2009 Y
2 100 200 15-May-2009 20-May-2009 Y

The values 100 and 200 combination makes the key value.
Sainath.Srinivasan
Participant
Posts: 3337
Joined: Mon Jan 17, 2005 4:49 am
Location: United Kingdom

Post by Sainath.Srinivasan »

In that case, would't the second record be discarded by the stage ?
Ratan Babu N
Participant
Posts: 34
Joined: Tue Dec 27, 2005 12:13 am

Post by Ratan Babu N »

Sainath.Srinivasan wrote:In that case, would't the second record be discarded by the stage ?
Actually first record should be discarded always. But in some runs first record is discarded and in some other runs second record is discarded.
So same job is yielding different results in different runs.
singhald
Participant
Posts: 180
Joined: Tue Aug 23, 2005 2:50 am
Location: Bangalore
Contact:

Post by singhald »

Try to use stable sort option and see the result
Regards,
Deepak Singhal
Everything is okay in the end. If it's not okay, then it's not the end.
Sainath.Srinivasan
Participant
Posts: 3337
Joined: Mon Jan 17, 2005 4:49 am
Location: United Kingdom

Post by Sainath.Srinivasan »

or create a "Date2 - Date1" column and use it in the sort.
Post Reply