Extracting First row of sorted data

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

abhilashnair
Participant
Posts: 284
Joined: Fri Oct 13, 2006 4:31 am

Extracting First row of sorted data

Post by abhilashnair »

I have the requirement as follows:

col1 is sorted ascending and col2 is descending. I need to fetch only the first row of the sort output. The scenario is that col1 contains the key and col2 is timestamp...key is repeated and timestamp is unique. I need only those keys with max(timestamp). So i have sorted col1 as ascending and col2 as descending. But how can I extract the first row for each key ?
samyamkrishna
Premium Member
Premium Member
Posts: 258
Joined: Tue Jul 04, 2006 10:35 pm
Location: Toronto

Re: Extracting First row of sorted data

Post by samyamkrishna »

You can use a remove duplicate and retain the first record.
abhilashnair
Participant
Posts: 284
Joined: Fri Oct 13, 2006 4:31 am

Post by abhilashnair »

Using Remove Duplicates wont work as there are no duplicates here..The combination of col1 and col2 is always unique
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

Yes, there are duplicates here, you have stated so yourself:
The scenario is that col1 contains the key and col2 is timestamp...key is repeated
You would remove duplicates based on col1 only in the remdup stage.
- james wiles


All generalizations are false, including this one - Mark Twain.
ds@tcs
Participant
Posts: 24
Joined: Thu Mar 17, 2011 6:26 am
Location: chennai
Contact:

Post by ds@tcs »

sort key column in ascending mode and timestamp in descending then use Header stage and there give only one row to send out put..
so obviously you will be getting only 1st row at the target
Regards
Sankar
abhilashnair
Participant
Posts: 284
Joined: Fri Oct 13, 2006 4:31 am

Post by abhilashnair »

The combination of col1 and col2 is never a duplicate. Col1 individually may have duplicates but when it combines with col2, then there is no duplicate. So we cannot use remove duplicates stage or even remove duplicates via sort stage. Using header also cannot solve the issue as it will only return first row of the entire dataset. To make things more clear, my sample data looks like this

Col1 Col2
A 05:00
A 04:59
A 04:58
B 03:00
B 02:30
B 02:00

and so on...
MT
Premium Member
Premium Member
Posts: 198
Joined: Fri Mar 09, 2007 3:51 am

Post by MT »

Hi abhilashnair,

you could use a sort stage to detect the group change and then filter it.
Alternatively you can do it "manually" in a transformer - use stage variable to detect the change of the key column and set this as condition for the output.
regards

Michael
antonyraj.deva
Premium Member
Premium Member
Posts: 138
Joined: Wed Jul 16, 2008 9:51 pm
Location: Kolkata

Post by antonyraj.deva »

When defining duplicates, it is done at the individual column level only not by concatenating 2 or more fields.

By using remove duplicates stage and retaining the first record as James and others have suggested, your output will be

Col1 Col2
A 05:00
B 03:00

This is the desired output as per your post. If there's something else you'd like to have as output, then explain the requirement.
TONY
ETL Manager
Infotrellis India

"Do what you can, with what you have, from where you are and to the best of your abilities."
samyamkrishna
Premium Member
Premium Member
Posts: 258
Joined: Tue Jul 04, 2006 10:35 pm
Location: Toronto

Post by samyamkrishna »

Hi Abhilash,

In remove duplicate use Col 1 only as the Key.
and retain the first record.

Thats all
TPons
Participant
Posts: 18
Joined: Mon Jan 03, 2011 3:32 am
Location: India

Post by TPons »

According to your requirement, for the following assumed sample data

col1(key) col2(Time stamp)
100 2011:03:10 08:20:00
100 2010:08:11 08:23:00
100 2010:03:10 10:30:00
101 2011:04:13 03:00:00
101 2009:02:10 01:00:00
101 2007:03:12 08:20:00
102 2010:07:16 05:30:00
102 2008:03:10 08:20:00

your output should be

col1(key) col2(Time stamp)
100 2011:03:10 08:20:00
101 2011:04:13 03:00:00
102 2010:07:16 05:30:00

If this is what your requirement, what samyamkrishna suggested will work, if your incoming input data has col1 in ascending order and col2 in descending order.

Use a Remove Duplicate Stage, Set col1 as key and Duplicate To Retain option to First.


Pons
Nagac
Premium Member
Premium Member
Posts: 127
Joined: Tue Mar 29, 2011 11:39 am
Location: India

Post by Nagac »

Use a Remove Duplicate Stage, Set col1 as key and Duplicate To Retain option to First...


It will 100% work ....
abhilashnair
Participant
Posts: 284
Joined: Fri Oct 13, 2006 4:31 am

Post by abhilashnair »

Remove Duplicates worked but here is the catch

My source has 6 records with same key and different timestamp.

Job structure was like ODBC----->RemDuplicate-------->DataSet

I was hash partitioning on col1 and col2 with inline sort in rem duplicate stage. Then selected the option Duplicate to Retain = First

It was not working..Three rows were being returned instead of 1.


Then I used col1 as the only hash key and the col2 was sorted in descending order and also col2 was removed as a partitioning key. In this scenario it worked..and I got desired output, but why this behaviour

Can anyone explain ? I will mark topic resolved once I get the answer
antonyraj.deva
Premium Member
Premium Member
Posts: 138
Joined: Wed Jul 16, 2008 9:51 pm
Location: Kolkata

Post by antonyraj.deva »

Abhilash,

If you do a total sort then hashing on both the columns would still give desired output.

If the job runs on three nodes without total sort the same value will be repeated thrice.
TONY
ETL Manager
Infotrellis India

"Do what you can, with what you have, from where you are and to the best of your abilities."
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

abhilashnair wrote:Then I used col1 as the only hash key and the col2 was sorted in descending order and also col2 was removed as a partitioning key. In this scenario it worked..and I got desired output, but why this behaviour
Well, because it was the right thing to do? :wink:

You partition on the grouping key(s) to ensure all occurances of the same key go to the same partition and what column(s) were you grouping on for the change detection? Just Col1.
-craig

"You can never have too many knives" -- Logan Nine Fingers
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

abhilashnair,

The purpose of a key-based partitioning such as Hash or Modulus is primarily to place like-valued (like-keyed) records in the same processing partition in order to meet the requirements of the business rules. Your business rules for this job were what you stated in your first e-mail: keep the record with the maximum value timestamp (col2) within a key value (col1) ("extract the first row for each key").

You're correct that there are no duplicates when you take the combination of col1 and col2. But that combination DOESN'T meet the business rules requirements. The statement "first row for each key" is the important concept here: It indicates that you have duplicates within a key value and that is the driving business rule--remove the duplicate key values while retaining only the one with the max timestamp value. To properly do this in a parallel processing environment, you must partition and sort your data so that records with the same key value are pulled together and in the proper order.

Each column you partition on with Hash affects which partition a record is sent to. As you needed to keep a single record within a key value (col1), the appropriate partitioning method is only using col1. By adding col2 to the partitioning, the distribution is affected and records with the same key value can end up in different partitions. You encountered this when you were getting three rows returned instead of one. The sorting and dedup logic occurs only within a partition and not across partitions.

I suggest you take some time to read further on partitioning and how it is used with other stages to accomplish a given type of task. The Parallel Job Developer Guide and a couple of IBM Redbooks on Parallel Framework Standard Practices and Job Design/Data Flow are very good references for this. You can Google for the redbooks or find the links in other posts in the forums. The Job Developer Guide is part of the documentation supplied with Information Server.

Hope this helps your understanding of the process!

Regards,
- james wiles


All generalizations are false, including this one - Mark Twain.
Post Reply