Removing duplicates but not giving desired output
Moderators: chulett, rschirm, roy
-
- Participant
- Posts: 34
- Joined: Tue Dec 27, 2005 12:13 am
Removing duplicates but not giving desired output
Hi,
I am having a requirement to select a record based on following criteria:
Record layout
Column1,Column 2, Date1, Date2,Flag
Flag can consists of either 'Y' or 'N'.
Based on Column 1, Column 2 if we get more than one record in the source then always take the record whose Flag is 'Y'. If the two records has same value in Flag (either Y or N) then need to take the record whose Date2 is nearest to Date1. Date1 will be same for this set of records. Always take the future nearest date if we have Date2 one day prior to Date1 for one record and Date2 2 days after Date1 for another record. So we took the difference of Date1 and Date2 in Transformer stage, if it is less than 0 means Date2 is future date of Date1 (because we used DaysSinceFromDate(Date1,Date2) function) then we generated a new column ColumnY in transformer stage where the value will be set to 1 and if the value is greater than or equal to zero then value will be set to 2. and in Column Z the value of diffrence in dates is taken (If it is less than zero then multiply with -1 else same value).
Another ColumnX whose value is 1 when Flag is 'Y' Else 2
Then in the next Sort stage "Hash" Partition on Column1, Column2 but "Sort" on Column1,Column2, ColumnX, ColumnY and ColumnZ (All in ascending order).
Then in the Next Remove Duplicates Stage Partiton is "Same" and Duplicate to retain is "First".
This job worked fine in Development and Test environments but in Production some times it is giving desired output and some times not.
Can any one help me out please.
I am having a requirement to select a record based on following criteria:
Record layout
Column1,Column 2, Date1, Date2,Flag
Flag can consists of either 'Y' or 'N'.
Based on Column 1, Column 2 if we get more than one record in the source then always take the record whose Flag is 'Y'. If the two records has same value in Flag (either Y or N) then need to take the record whose Date2 is nearest to Date1. Date1 will be same for this set of records. Always take the future nearest date if we have Date2 one day prior to Date1 for one record and Date2 2 days after Date1 for another record. So we took the difference of Date1 and Date2 in Transformer stage, if it is less than 0 means Date2 is future date of Date1 (because we used DaysSinceFromDate(Date1,Date2) function) then we generated a new column ColumnY in transformer stage where the value will be set to 1 and if the value is greater than or equal to zero then value will be set to 2. and in Column Z the value of diffrence in dates is taken (If it is less than zero then multiply with -1 else same value).
Another ColumnX whose value is 1 when Flag is 'Y' Else 2
Then in the next Sort stage "Hash" Partition on Column1, Column2 but "Sort" on Column1,Column2, ColumnX, ColumnY and ColumnZ (All in ascending order).
Then in the Next Remove Duplicates Stage Partiton is "Same" and Duplicate to retain is "First".
This job worked fine in Development and Test environments but in Production some times it is giving desired output and some times not.
Can any one help me out please.
-
- Participant
- Posts: 7
- Joined: Tue Apr 21, 2009 2:27 am
- Location: chennai
Re: Removing duplicates but not giving desired output
Its pretty simple if the flag is different.
When the flag is same, DaysSinceFromDate(Date1,Date2) function will give -1,0 or 1 as output. Obviously, if the output for two rows are -1 & 1 then the one having -1 is the preferred one.
So, we can continue with this simple comparision. However, if the output for both the two rows are -1 or 1, then simply compare the column date2 of both the rows using the same function DaysSinceFromDate(Date2 row1,Date2 row2) to get which comes first.
Hence overall step is as follows:-
1) check if the flag is same or different.
2) If Different then the one having 'Y' is the one we have to take.
3) If Same, Compare the date1,date2 of all columns to come to -1,0 or 1 values and then compare them.
4) If the two rows has different set of values from -1,0,1 then its the one in -1 that we have to take and accordingly.
5) If both have same value, then compare the two rows date2 for the same function to get -1,0,1 and then similar comparision to get the one closer.
When the flag is same, DaysSinceFromDate(Date1,Date2) function will give -1,0 or 1 as output. Obviously, if the output for two rows are -1 & 1 then the one having -1 is the preferred one.
So, we can continue with this simple comparision. However, if the output for both the two rows are -1 or 1, then simply compare the column date2 of both the rows using the same function DaysSinceFromDate(Date2 row1,Date2 row2) to get which comes first.
Hence overall step is as follows:-
1) check if the flag is same or different.
2) If Different then the one having 'Y' is the one we have to take.
3) If Same, Compare the date1,date2 of all columns to come to -1,0 or 1 values and then compare them.
4) If the two rows has different set of values from -1,0,1 then its the one in -1 that we have to take and accordingly.
5) If both have same value, then compare the two rows date2 for the same function to get -1,0,1 and then similar comparision to get the one closer.
Anshul
Re: Removing duplicates but not giving desired output
First thought? What is different between the dev/test and production environments? Something, obviously. Could be as simple as production being the first place where the job runs over multiple nodes for all we know.Ratan Babu N wrote:This job worked fine in Development and Test environments but in Production some times it is giving desired output and some times not.
If you still need help, I'd suggest giving us a better idea what 'sometimes not' means. What is wrong with the output 'at times'? How is it different from what you desire? Help us help you - simply saying "it's wrong" doesn't really further that end.
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
-
- Participant
- Posts: 34
- Joined: Tue Dec 27, 2005 12:13 am
Re: Removing duplicates but not giving desired output
In all the environments we are running the jobs using 4 nodes configuration files.
For more clarity on the issue, please find below some examples of source data,
Record Column1 Column2 Date1 Date2 Flag
1 100 200 15-May-2009 24-May-2009 Y
2 100 200 15-May-2009 20-May-2009 N
In the above e.g, Record 1 should be fetched as the flag is 'Y'
Record Column1 Column2 Date1 Date2 Flag
1 100 200 15-May-2009 13-May-2009 Y
2 100 200 15-May-2009 10-May-2009 Y
In the above e.g, Record 1 should be fetched as Date2 is more nearer to Date1
eventhough it is a past date
Record Column1 Column2 Date1 Date2 Flag
1 100 200 15-May-2009 24-May-2009 Y
2 100 200 15-May-2009 20-May-2009 Y
In the above e.g, Record 2 should be fetched as Date2 is more nearer to Date1
Record Column1 Column2 Date1 Date2 Flag
1 100 200 15-May-2009 19-May-2009 N
2 100 200 15-May-2009 27-May-2009 N
In the above e.g, Record 1 should be fetched as Date2 is more nearer to Date1
Record Column1 Column2 Date1 Date2 Flag
1 100 200 15-May-2009 11-May-2009 Y
2 100 200 15-May-2009 26-May-2009 Y
In the above e.g, Record 2 should be fetched as in Record 2 Date2 is future
date of Date1
In the above examples, Record is not a column, its just for the reference to mention
as Record 1 or Record 2.
In all the above scenarios sometimes I am getting the expected results and
sometimes the incorrect records are fetched.
Let me know if more information is needed.
For more clarity on the issue, please find below some examples of source data,
Record Column1 Column2 Date1 Date2 Flag
1 100 200 15-May-2009 24-May-2009 Y
2 100 200 15-May-2009 20-May-2009 N
In the above e.g, Record 1 should be fetched as the flag is 'Y'
Record Column1 Column2 Date1 Date2 Flag
1 100 200 15-May-2009 13-May-2009 Y
2 100 200 15-May-2009 10-May-2009 Y
In the above e.g, Record 1 should be fetched as Date2 is more nearer to Date1
eventhough it is a past date
Record Column1 Column2 Date1 Date2 Flag
1 100 200 15-May-2009 24-May-2009 Y
2 100 200 15-May-2009 20-May-2009 Y
In the above e.g, Record 2 should be fetched as Date2 is more nearer to Date1
Record Column1 Column2 Date1 Date2 Flag
1 100 200 15-May-2009 19-May-2009 N
2 100 200 15-May-2009 27-May-2009 N
In the above e.g, Record 1 should be fetched as Date2 is more nearer to Date1
Record Column1 Column2 Date1 Date2 Flag
1 100 200 15-May-2009 11-May-2009 Y
2 100 200 15-May-2009 26-May-2009 Y
In the above e.g, Record 2 should be fetched as in Record 2 Date2 is future
date of Date1
In the above examples, Record is not a column, its just for the reference to mention
as Record 1 or Record 2.
In all the above scenarios sometimes I am getting the expected results and
sometimes the incorrect records are fetched.
Let me know if more information is needed.
-
- Participant
- Posts: 3337
- Joined: Mon Jan 17, 2005 4:49 am
- Location: United Kingdom
-
- Participant
- Posts: 34
- Joined: Tue Dec 27, 2005 12:13 am
It is picking the wrong record but not both the records.
For example:
Record Column1 Column2 Date1 Date2 Flag
1 100 200 15-May-2009 24-May-2009 Y
2 100 200 15-May-2009 20-May-2009 Y
In the above e.g, Record 2 should be fetched as Date2 is more nearer to Date1. but some times it is fetching Record 1 which is not correct.
For example:
Record Column1 Column2 Date1 Date2 Flag
1 100 200 15-May-2009 24-May-2009 Y
2 100 200 15-May-2009 20-May-2009 Y
In the above e.g, Record 2 should be fetched as Date2 is more nearer to Date1. but some times it is fetching Record 1 which is not correct.
-
- Participant
- Posts: 3337
- Joined: Mon Jan 17, 2005 4:49 am
- Location: United Kingdom
-
- Participant
- Posts: 34
- Joined: Tue Dec 27, 2005 12:13 am
Since the Sort stage is "Hash" Partitioned on Column1, Column2 but "Sort" on Column1,Column2, ColumnX, ColumnY and ColumnZ (All in ascending order). And in the Next Remove Duplicates Stage Partiton is "Same" and Duplicate to retain is "First". Always the first record should be desired one right(As per the requirement)?Sainath.Srinivasan wrote:Did you check whether this is the impact of 'First' in remove duplicates?
Is the sort order will change when the record reaches the Remove duplicates stage?
-
- Participant
- Posts: 3337
- Joined: Mon Jan 17, 2005 4:49 am
- Location: United Kingdom
-
- Participant
- Posts: 34
- Joined: Tue Dec 27, 2005 12:13 am
-
- Participant
- Posts: 3337
- Joined: Mon Jan 17, 2005 4:49 am
- Location: United Kingdom
-
- Participant
- Posts: 34
- Joined: Tue Dec 27, 2005 12:13 am
-
- Participant
- Posts: 3337
- Joined: Mon Jan 17, 2005 4:49 am
- Location: United Kingdom