Extracting First row of sorted data
Moderators: chulett, rschirm, roy
-
- Participant
- Posts: 284
- Joined: Fri Oct 13, 2006 4:31 am
Extracting First row of sorted data
I have the requirement as follows:
col1 is sorted ascending and col2 is descending. I need to fetch only the first row of the sort output. The scenario is that col1 contains the key and col2 is timestamp...key is repeated and timestamp is unique. I need only those keys with max(timestamp). So i have sorted col1 as ascending and col2 as descending. But how can I extract the first row for each key ?
col1 is sorted ascending and col2 is descending. I need to fetch only the first row of the sort output. The scenario is that col1 contains the key and col2 is timestamp...key is repeated and timestamp is unique. I need only those keys with max(timestamp). So i have sorted col1 as ascending and col2 as descending. But how can I extract the first row for each key ?
-
- Premium Member
- Posts: 258
- Joined: Tue Jul 04, 2006 10:35 pm
- Location: Toronto
Re: Extracting First row of sorted data
You can use a remove duplicate and retain the first record.
-
- Participant
- Posts: 284
- Joined: Fri Oct 13, 2006 4:31 am
-
- Participant
- Posts: 284
- Joined: Fri Oct 13, 2006 4:31 am
The combination of col1 and col2 is never a duplicate. Col1 individually may have duplicates but when it combines with col2, then there is no duplicate. So we cannot use remove duplicates stage or even remove duplicates via sort stage. Using header also cannot solve the issue as it will only return first row of the entire dataset. To make things more clear, my sample data looks like this
Col1 Col2
A 05:00
A 04:59
A 04:58
B 03:00
B 02:30
B 02:00
and so on...
Col1 Col2
A 05:00
A 04:59
A 04:58
B 03:00
B 02:30
B 02:00
and so on...
-
- Premium Member
- Posts: 138
- Joined: Wed Jul 16, 2008 9:51 pm
- Location: Kolkata
When defining duplicates, it is done at the individual column level only not by concatenating 2 or more fields.
By using remove duplicates stage and retaining the first record as James and others have suggested, your output will be
Col1 Col2
A 05:00
B 03:00
This is the desired output as per your post. If there's something else you'd like to have as output, then explain the requirement.
By using remove duplicates stage and retaining the first record as James and others have suggested, your output will be
Col1 Col2
A 05:00
B 03:00
This is the desired output as per your post. If there's something else you'd like to have as output, then explain the requirement.
TONY
ETL Manager
Infotrellis India
"Do what you can, with what you have, from where you are and to the best of your abilities."
ETL Manager
Infotrellis India
"Do what you can, with what you have, from where you are and to the best of your abilities."
-
- Premium Member
- Posts: 258
- Joined: Tue Jul 04, 2006 10:35 pm
- Location: Toronto
According to your requirement, for the following assumed sample data
col1(key) col2(Time stamp)
100 2011:03:10 08:20:00
100 2010:08:11 08:23:00
100 2010:03:10 10:30:00
101 2011:04:13 03:00:00
101 2009:02:10 01:00:00
101 2007:03:12 08:20:00
102 2010:07:16 05:30:00
102 2008:03:10 08:20:00
your output should be
col1(key) col2(Time stamp)
100 2011:03:10 08:20:00
101 2011:04:13 03:00:00
102 2010:07:16 05:30:00
If this is what your requirement, what samyamkrishna suggested will work, if your incoming input data has col1 in ascending order and col2 in descending order.
Use a Remove Duplicate Stage, Set col1 as key and Duplicate To Retain option to First.
Pons
col1(key) col2(Time stamp)
100 2011:03:10 08:20:00
100 2010:08:11 08:23:00
100 2010:03:10 10:30:00
101 2011:04:13 03:00:00
101 2009:02:10 01:00:00
101 2007:03:12 08:20:00
102 2010:07:16 05:30:00
102 2008:03:10 08:20:00
your output should be
col1(key) col2(Time stamp)
100 2011:03:10 08:20:00
101 2011:04:13 03:00:00
102 2010:07:16 05:30:00
If this is what your requirement, what samyamkrishna suggested will work, if your incoming input data has col1 in ascending order and col2 in descending order.
Use a Remove Duplicate Stage, Set col1 as key and Duplicate To Retain option to First.
Pons
-
- Participant
- Posts: 284
- Joined: Fri Oct 13, 2006 4:31 am
Remove Duplicates worked but here is the catch
My source has 6 records with same key and different timestamp.
Job structure was like ODBC----->RemDuplicate-------->DataSet
I was hash partitioning on col1 and col2 with inline sort in rem duplicate stage. Then selected the option Duplicate to Retain = First
It was not working..Three rows were being returned instead of 1.
Then I used col1 as the only hash key and the col2 was sorted in descending order and also col2 was removed as a partitioning key. In this scenario it worked..and I got desired output, but why this behaviour
Can anyone explain ? I will mark topic resolved once I get the answer
My source has 6 records with same key and different timestamp.
Job structure was like ODBC----->RemDuplicate-------->DataSet
I was hash partitioning on col1 and col2 with inline sort in rem duplicate stage. Then selected the option Duplicate to Retain = First
It was not working..Three rows were being returned instead of 1.
Then I used col1 as the only hash key and the col2 was sorted in descending order and also col2 was removed as a partitioning key. In this scenario it worked..and I got desired output, but why this behaviour
Can anyone explain ? I will mark topic resolved once I get the answer
-
- Premium Member
- Posts: 138
- Joined: Wed Jul 16, 2008 9:51 pm
- Location: Kolkata
Abhilash,
If you do a total sort then hashing on both the columns would still give desired output.
If the job runs on three nodes without total sort the same value will be repeated thrice.
If you do a total sort then hashing on both the columns would still give desired output.
If the job runs on three nodes without total sort the same value will be repeated thrice.
TONY
ETL Manager
Infotrellis India
"Do what you can, with what you have, from where you are and to the best of your abilities."
ETL Manager
Infotrellis India
"Do what you can, with what you have, from where you are and to the best of your abilities."
Well, because it was the right thing to do?abhilashnair wrote:Then I used col1 as the only hash key and the col2 was sorted in descending order and also col2 was removed as a partitioning key. In this scenario it worked..and I got desired output, but why this behaviour
You partition on the grouping key(s) to ensure all occurances of the same key go to the same partition and what column(s) were you grouping on for the change detection? Just Col1.
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
abhilashnair,
The purpose of a key-based partitioning such as Hash or Modulus is primarily to place like-valued (like-keyed) records in the same processing partition in order to meet the requirements of the business rules. Your business rules for this job were what you stated in your first e-mail: keep the record with the maximum value timestamp (col2) within a key value (col1) ("extract the first row for each key").
You're correct that there are no duplicates when you take the combination of col1 and col2. But that combination DOESN'T meet the business rules requirements. The statement "first row for each key" is the important concept here: It indicates that you have duplicates within a key value and that is the driving business rule--remove the duplicate key values while retaining only the one with the max timestamp value. To properly do this in a parallel processing environment, you must partition and sort your data so that records with the same key value are pulled together and in the proper order.
Each column you partition on with Hash affects which partition a record is sent to. As you needed to keep a single record within a key value (col1), the appropriate partitioning method is only using col1. By adding col2 to the partitioning, the distribution is affected and records with the same key value can end up in different partitions. You encountered this when you were getting three rows returned instead of one. The sorting and dedup logic occurs only within a partition and not across partitions.
I suggest you take some time to read further on partitioning and how it is used with other stages to accomplish a given type of task. The Parallel Job Developer Guide and a couple of IBM Redbooks on Parallel Framework Standard Practices and Job Design/Data Flow are very good references for this. You can Google for the redbooks or find the links in other posts in the forums. The Job Developer Guide is part of the documentation supplied with Information Server.
Hope this helps your understanding of the process!
Regards,
The purpose of a key-based partitioning such as Hash or Modulus is primarily to place like-valued (like-keyed) records in the same processing partition in order to meet the requirements of the business rules. Your business rules for this job were what you stated in your first e-mail: keep the record with the maximum value timestamp (col2) within a key value (col1) ("extract the first row for each key").
You're correct that there are no duplicates when you take the combination of col1 and col2. But that combination DOESN'T meet the business rules requirements. The statement "first row for each key" is the important concept here: It indicates that you have duplicates within a key value and that is the driving business rule--remove the duplicate key values while retaining only the one with the max timestamp value. To properly do this in a parallel processing environment, you must partition and sort your data so that records with the same key value are pulled together and in the proper order.
Each column you partition on with Hash affects which partition a record is sent to. As you needed to keep a single record within a key value (col1), the appropriate partitioning method is only using col1. By adding col2 to the partitioning, the distribution is affected and records with the same key value can end up in different partitions. You encountered this when you were getting three rows returned instead of one. The sorting and dedup logic occurs only within a partition and not across partitions.
I suggest you take some time to read further on partitioning and how it is used with other stages to accomplish a given type of task. The Parallel Job Developer Guide and a couple of IBM Redbooks on Parallel Framework Standard Practices and Job Design/Data Flow are very good references for this. You can Google for the redbooks or find the links in other posts in the forums. The Job Developer Guide is part of the documentation supplied with Information Server.
Hope this helps your understanding of the process!
Regards,
- james wiles
All generalizations are false, including this one - Mark Twain.
All generalizations are false, including this one - Mark Twain.