Extracting First row of sorted data

abhilashnair · Post by **abhilashnair** » Thu Apr 07, 2011 5:05 am

I have the requirement as follows:

col1 is sorted ascending and col2 is descending. I need to fetch only the first row of the sort output. The scenario is that col1 contains the key and col2 is timestamp...key is repeated and timestamp is unique. I need only those keys with max(timestamp). So i have sorted col1 as ascending and col2 as descending. But how can I extract the first row for each key ?

samyamkrishna · Post by **samyamkrishna** » Thu Apr 07, 2011 6:22 am

You can use a remove duplicate and retain the first record.

abhilashnair · Post by **abhilashnair** » Thu Apr 07, 2011 12:57 pm

Using Remove Duplicates wont work as there are no duplicates here..The combination of col1 and col2 is always unique

jwiles · Post by **jwiles** » Thu Apr 07, 2011 1:39 pm

Yes, there are duplicates here, you have stated so yourself:

The scenario is that col1 contains the key and col2 is timestamp...key is repeated

You would remove duplicates based on col1 only in the remdup stage.

ds@tcs · Post by **ds@tcs** » Fri Apr 08, 2011 12:45 am

sort key column in ascending mode and timestamp in descending then use Header stage and there give only one row to send out put..
so obviously you will be getting only 1st row at the target

abhilashnair · Post by **abhilashnair** » Fri Apr 08, 2011 1:06 am

The combination of col1 and col2 is never a duplicate. Col1 individually may have duplicates but when it combines with col2, then there is no duplicate. So we cannot use remove duplicates stage or even remove duplicates via sort stage. Using header also cannot solve the issue as it will only return first row of the entire dataset. To make things more clear, my sample data looks like this

Col1 Col2
A 05:00
A 04:59
A 04:58
B 03:00
B 02:30
B 02:00

and so on...

MT · Post by MT » Fri Apr 08, 2011 1:25 am

Hi abhilashnair,

you could use a sort stage to detect the group change and then filter it.
Alternatively you can do it "manually" in a transformer - use stage variable to detect the change of the key column and set this as condition for the output.

antonyraj.deva · Post by **antonyraj.deva** » Fri Apr 08, 2011 1:25 am

When defining duplicates, it is done at the individual column level only not by concatenating 2 or more fields.

By using remove duplicates stage and retaining the first record as James and others have suggested, your output will be

Col1 Col2
A 05:00
B 03:00

This is the desired output as per your post. If there's something else you'd like to have as output, then explain the requirement.

samyamkrishna · Post by **samyamkrishna** » Fri Apr 08, 2011 1:26 am

Hi Abhilash,

In remove duplicate use Col 1 only as the Key.
and retain the first record.

Thats all

TPons · Post by **TPons** » Fri Apr 08, 2011 1:31 am

According to your requirement, for the following assumed sample data

col1(key) col2(Time stamp)
100 2011:03:10 08:20:00
100 2010:08:11 08:23:00
100 2010:03:10 10:30:00
101 2011:04:13 03:00:00
101 2009:02:10 01:00:00
101 2007:03:12 08:20:00
102 2010:07:16 05:30:00
102 2008:03:10 08:20:00

your output should be

col1(key) col2(Time stamp)
100 2011:03:10 08:20:00
101 2011:04:13 03:00:00
102 2010:07:16 05:30:00

If this is what your requirement, what samyamkrishna suggested will work, if your incoming input data has col1 in ascending order and col2 in descending order.

Use a Remove Duplicate Stage, Set col1 as key and Duplicate To Retain option to First.

Pons

Nagac · Post by **Nagac** » Fri Apr 08, 2011 1:50 am

Use a Remove Duplicate Stage, Set col1 as key and Duplicate To Retain option to First...

It will 100% work ....

abhilashnair · Post by **abhilashnair** » Fri Apr 08, 2011 3:36 am

Remove Duplicates worked but here is the catch

My source has 6 records with same key and different timestamp.

Job structure was like ODBC----->RemDuplicate-------->DataSet

I was hash partitioning on col1 and col2 with inline sort in rem duplicate stage. Then selected the option Duplicate to Retain = First

It was not working..Three rows were being returned instead of 1.

Then I used col1 as the only hash key and the col2 was sorted in descending order and also col2 was removed as a partitioning key. In this scenario it worked..and I got desired output, but why this behaviour

Can anyone explain ? I will mark topic resolved once I get the answer

antonyraj.deva · Post by **antonyraj.deva** » Fri Apr 08, 2011 6:27 am

Abhilash,

If you do a total sort then hashing on both the columns would still give desired output.

If the job runs on three nodes without total sort the same value will be repeated thrice.

chulett · Post by **chulett** » Fri Apr 08, 2011 6:44 am

abhilashnair wrote:Then I used col1 as the only hash key and the col2 was sorted in descending order and also col2 was removed as a partitioning key. In this scenario it worked..and I got desired output, but why this behaviour

Well, because it was the right thing to do?

You partition on the grouping key(s) to ensure all occurances of the same key go to the same partition and what column(s) were you grouping on for the change detection? Just Col1.

jwiles · Post by **jwiles** » Fri Apr 08, 2011 10:57 am

abhilashnair,

The purpose of a key-based partitioning such as Hash or Modulus is primarily to place like-valued (like-keyed) records in the same processing partition in order to meet the requirements of the business rules. Your business rules for this job were what you stated in your first e-mail: keep the record with the maximum value timestamp (col2) within a key value (col1) ("extract the first row for each key").

You're correct that there are no duplicates when you take the combination of col1 and col2. But that combination DOESN'T meet the business rules requirements. The statement "first row for each key" is the important concept here: It indicates that you have duplicates within a key value and that is the driving business rule--remove the duplicate key values while retaining only the one with the max timestamp value. To properly do this in a parallel processing environment, you must partition and sort your data so that records with the same key value are pulled together and in the proper order.

Each column you partition on with Hash affects which partition a record is sent to. As you needed to keep a single record within a key value (col1), the appropriate partitioning method is only using col1. By adding col2 to the partitioning, the distribution is affected and records with the same key value can end up in different partitions. You encountered this when you were getting three rows returned instead of one. The sorting and dedup logic occurs only within a partition and not across partitions.

I suggest you take some time to read further on partitioning and how it is used with other stages to accomplish a given type of task. The Parallel Job Developer Guide and a couple of IBM Redbooks on Parallel Framework Standard Practices and Job Design/Data Flow are very good references for this. You can Google for the redbooks or find the links in other posts in the forums. The Job Developer Guide is part of the documentation supplied with Information Server.

Hope this helps your understanding of the process!

Regards,

DSXchange

Extracting First row of sorted data

Extracting First row of sorted data

Re: Extracting First row of sorted data