HOw to group by in datastage?

abhinavsuri · Post by **abhinavsuri** » Tue Jun 30, 2009 12:11 am

I have to read employee data from sequenetial file. Then I need to find the 3 persons who earn the highest salary in a department of each company.(i.e. we need to group by ORGANISATION NAME,DEPARTMENT) How can i do this?

Columns in source
ORGANISATION NAME
EMP_ID
DEPARTMENT
SAL

Please note that I do not want to write this data to a temporary table and then perform a group by

ray.wurlod · Post by **ray.wurlod** » Tue Jun 30, 2009 1:06 am

DataStage is not a reporting tool.

Use a reporting tool for this task.

Sure you could do it, with an Aggregator, a Sort and a Head stage (to suggest one method), but you really need to get your head around using the right tool for the right job. If you don't, "they" will be requiring you to use DataStage to generate all their reports.

abhinavsuri · Post by **abhinavsuri** » Tue Jun 30, 2009 2:14 am

Thanks a lot for the reply. I feel kind of foolish right nw for not thinking of the head stage. Thanks once again.

abhinavsuri · Post by **abhinavsuri** » Tue Jun 30, 2009 4:07 am

But how can i get top3 for each group of (LOCATION,DEPT) using head stage. The head stage only gives top3 of entire data set

jcthornton · Post by **jcthornton** » Tue Jun 30, 2009 7:43 am

I'm curious too what the need is for just getting the top three by LOCATION, DEPT. It does sound like reporting, which - as said above - is done better by other tools designed for that purpose.

If you must do it in DS, there are multiple ways to go about it. For instance:

Sort by LOCATION (asc), DEPT (asc), Salary (Desc) with partitioning set on LOCATION, DEPT.

Use a transformer with a stage variable that counts up, resetting on each new LOCATION, DEPT combination. Put a constraint on your output that svCOUNT <= 3.

abhinavsuri · Post by **abhinavsuri** » Wed Jul 01, 2009 2:08 am

Thanks jc , I did it. Actually I had faced a similar challege before where I had to generate rowid' starting from 1 for each group. At that time I had loaded the sequential file to a TEMP Oracle table. Then I generated the ROW NUMBER's by using partition by clause. Was just just curious if we can achieve the same in datastage. I dont find normal ETL requirements that challenging anymore so just trying out new challenges in free time.

ray.wurlod · Post by **ray.wurlod** » Wed Jul 01, 2009 4:38 pm

It can easily be done in DataStage; I can think of at least three ways. But, since you like a challenge, I'll leave you to research them.

Here's another challenge - make sure your strategy works properly when the parallel job is running on more than one node.