Page 1 of 1

HOw to group by in datastage?

Posted: Tue Jun 30, 2009 12:11 am
by abhinavsuri
I have to read employee data from sequenetial file. Then I need to find the 3 persons who earn the highest salary in a department of each company.(i.e. we need to group by ORGANISATION NAME,DEPARTMENT) How can i do this?

Columns in source
ORGANISATION NAME
EMP_ID
DEPARTMENT
SAL

Please note that I do not want to write this data to a temporary table and then perform a group by

Posted: Tue Jun 30, 2009 1:06 am
by ray.wurlod
DataStage is not a reporting tool.

Use a reporting tool for this task.

Sure you could do it, with an Aggregator, a Sort and a Head stage (to suggest one method), but you really need to get your head around using the right tool for the right job. If you don't, "they" will be requiring you to use DataStage to generate all their reports.

Posted: Tue Jun 30, 2009 2:14 am
by abhinavsuri
Thanks a lot for the reply. I feel kind of foolish right nw for not thinking of the head stage. Thanks once again.

Posted: Tue Jun 30, 2009 4:07 am
by abhinavsuri
But how can i get top3 for each group of (LOCATION,DEPT) using head stage. The head stage only gives top3 of entire data set

Posted: Tue Jun 30, 2009 7:43 am
by jcthornton
I'm curious too what the need is for just getting the top three by LOCATION, DEPT. It does sound like reporting, which - as said above - is done better by other tools designed for that purpose.

If you must do it in DS, there are multiple ways to go about it. For instance:

Sort by LOCATION (asc), DEPT (asc), Salary (Desc) with partitioning set on LOCATION, DEPT.

Use a transformer with a stage variable that counts up, resetting on each new LOCATION, DEPT combination. Put a constraint on your output that svCOUNT <= 3.

Posted: Wed Jul 01, 2009 2:08 am
by abhinavsuri
Thanks jc , I did it. Actually I had faced a similar challege before where I had to generate rowid' starting from 1 for each group. At that time I had loaded the sequential file to a TEMP Oracle table. Then I generated the ROW NUMBER's by using partition by clause. Was just just curious if we can achieve the same in datastage. I dont find normal ETL requirements that challenging anymore so just trying out new challenges in free time.

Posted: Wed Jul 01, 2009 4:38 pm
by ray.wurlod
It can easily be done in DataStage; I can think of at least three ways. But, since you like a challenge, I'll leave you to research them.

Here's another challenge - make sure your strategy works properly when the parallel job is running on more than one node.