Page 1 of 1

Removing duplicates but not giving desired output

Posted: Fri May 22, 2009 5:50 am
by Ratan Babu N
Hi,

I am having a requirement to select a record based on following criteria:

Record layout
Column1,Column 2, Date1, Date2,Flag

Flag can consists of either 'Y' or 'N'.

Based on Column 1, Column 2 if we get more than one record in the source then always take the record whose Flag is 'Y'. If the two records has same value in Flag (either Y or N) then need to take the record whose Date2 is nearest to Date1. Date1 will be same for this set of records. Always take the future nearest date if we have Date2 one day prior to Date1 for one record and Date2 2 days after Date1 for another record. So we took the difference of Date1 and Date2 in Transformer stage, if it is less than 0 means Date2 is future date of Date1 (because we used DaysSinceFromDate(Date1,Date2) function) then we generated a new column ColumnY in transformer stage where the value will be set to 1 and if the value is greater than or equal to zero then value will be set to 2. and in Column Z the value of diffrence in dates is taken (If it is less than zero then multiply with -1 else same value).

Another ColumnX whose value is 1 when Flag is 'Y' Else 2

Then in the next Sort stage "Hash" Partition on Column1, Column2 but "Sort" on Column1,Column2, ColumnX, ColumnY and ColumnZ (All in ascending order).

Then in the Next Remove Duplicates Stage Partiton is "Same" and Duplicate to retain is "First".

This job worked fine in Development and Test environments but in Production some times it is giving desired output and some times not.

Can any one help me out please.

Re: Removing duplicates but not giving desired output

Posted: Mon May 25, 2009 4:16 am
by anshul_roy
Its pretty simple if the flag is different.
When the flag is same, DaysSinceFromDate(Date1,Date2) function will give -1,0 or 1 as output. Obviously, if the output for two rows are -1 & 1 then the one having -1 is the preferred one.

So, we can continue with this simple comparision. However, if the output for both the two rows are -1 or 1, then simply compare the column date2 of both the rows using the same function DaysSinceFromDate(Date2 row1,Date2 row2) to get which comes first.


Hence overall step is as follows:-

1) check if the flag is same or different.
2) If Different then the one having 'Y' is the one we have to take.
3) If Same, Compare the date1,date2 of all columns to come to -1,0 or 1 values and then compare them.
4) If the two rows has different set of values from -1,0,1 then its the one in -1 that we have to take and accordingly.
5) If both have same value, then compare the two rows date2 for the same function to get -1,0,1 and then similar comparision to get the one closer.

Re: Removing duplicates but not giving desired output

Posted: Mon May 25, 2009 7:58 am
by chulett
Ratan Babu N wrote:This job worked fine in Development and Test environments but in Production some times it is giving desired output and some times not.
First thought? What is different between the dev/test and production environments? Something, obviously. Could be as simple as production being the first place where the job runs over multiple nodes for all we know.

If you still need help, I'd suggest giving us a better idea what 'sometimes not' means. What is wrong with the output 'at times'? How is it different from what you desire? Help us help you - simply saying "it's wrong" doesn't really further that end.

Re: Removing duplicates but not giving desired output

Posted: Wed May 27, 2009 3:36 am
by Ratan Babu N
In all the environments we are running the jobs using 4 nodes configuration files.

For more clarity on the issue, please find below some examples of source data,

Record Column1 Column2 Date1 Date2 Flag
1 100 200 15-May-2009 24-May-2009 Y
2 100 200 15-May-2009 20-May-2009 N

In the above e.g, Record 1 should be fetched as the flag is 'Y'

Record Column1 Column2 Date1 Date2 Flag
1 100 200 15-May-2009 13-May-2009 Y
2 100 200 15-May-2009 10-May-2009 Y

In the above e.g, Record 1 should be fetched as Date2 is more nearer to Date1
eventhough it is a past date

Record Column1 Column2 Date1 Date2 Flag
1 100 200 15-May-2009 24-May-2009 Y
2 100 200 15-May-2009 20-May-2009 Y

In the above e.g, Record 2 should be fetched as Date2 is more nearer to Date1

Record Column1 Column2 Date1 Date2 Flag
1 100 200 15-May-2009 19-May-2009 N
2 100 200 15-May-2009 27-May-2009 N

In the above e.g, Record 1 should be fetched as Date2 is more nearer to Date1

Record Column1 Column2 Date1 Date2 Flag
1 100 200 15-May-2009 11-May-2009 Y
2 100 200 15-May-2009 26-May-2009 Y

In the above e.g, Record 2 should be fetched as in Record 2 Date2 is future
date of Date1

In the above examples, Record is not a column, its just for the reference to mention
as Record 1 or Record 2.
In all the above scenarios sometimes I am getting the expected results and
sometimes the incorrect records are fetched.

Let me know if more information is needed.

Posted: Wed May 27, 2009 4:22 am
by Sainath.Srinivasan
What does the 'sometimes' have in common ?

What incorrect results do you receive? i.e. does it pick wrong record or both?

Posted: Wed May 27, 2009 5:25 am
by Ratan Babu N
It is picking the wrong record but not both the records.

For example:
Record Column1 Column2 Date1 Date2 Flag
1 100 200 15-May-2009 24-May-2009 Y
2 100 200 15-May-2009 20-May-2009 Y

In the above e.g, Record 2 should be fetched as Date2 is more nearer to Date1. but some times it is fetching Record 1 which is not correct.

Posted: Thu May 28, 2009 2:53 am
by Sainath.Srinivasan
Did you check whether this is the impact of 'First' in remove duplicates?

Posted: Thu May 28, 2009 6:01 am
by Ratan Babu N
Sainath.Srinivasan wrote:Did you check whether this is the impact of 'First' in remove duplicates?
Since the Sort stage is "Hash" Partitioned on Column1, Column2 but "Sort" on Column1,Column2, ColumnX, ColumnY and ColumnZ (All in ascending order). And in the Next Remove Duplicates Stage Partiton is "Same" and Duplicate to retain is "First". Always the first record should be desired one right(As per the requirement)?

Is the sort order will change when the record reaches the Remove duplicates stage?

Posted: Thu May 28, 2009 7:45 am
by Sainath.Srinivasan
What are your keys for remove duplicates ?

Posted: Thu May 28, 2009 8:04 am
by Ratan Babu N
Sainath.Srinivasan wrote:What are your keys for remove duplicates ?
Column1 and Column2 i.e, In example:
Record Column1 Column2 Date1 Date2 Flag
1 100 200 15-May-2009 24-May-2009 Y
2 100 200 15-May-2009 20-May-2009 Y

The values 100 and 200 combination makes the key value.

Posted: Thu May 28, 2009 10:41 am
by Sainath.Srinivasan
In that case, would't the second record be discarded by the stage ?

Posted: Fri May 29, 2009 4:31 am
by Ratan Babu N
Sainath.Srinivasan wrote:In that case, would't the second record be discarded by the stage ?
Actually first record should be discarded always. But in some runs first record is discarded and in some other runs second record is discarded.
So same job is yielding different results in different runs.

Posted: Fri May 29, 2009 4:48 am
by singhald
Try to use stable sort option and see the result

Posted: Fri May 29, 2009 5:46 am
by Sainath.Srinivasan
or create a "Date2 - Date1" column and use it in the sort.