HOw to group by in datastage?

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
abhinavsuri
Premium Member
Premium Member
Posts: 62
Joined: Thu Dec 28, 2006 11:54 pm

HOw to group by in datastage?

Post by abhinavsuri »

I have to read employee data from sequenetial file. Then I need to find the 3 persons who earn the highest salary in a department of each company.(i.e. we need to group by ORGANISATION NAME,DEPARTMENT) How can i do this?

Columns in source
ORGANISATION NAME
EMP_ID
DEPARTMENT
SAL

Please note that I do not want to write this data to a temporary table and then perform a group by
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

DataStage is not a reporting tool.

Use a reporting tool for this task.

Sure you could do it, with an Aggregator, a Sort and a Head stage (to suggest one method), but you really need to get your head around using the right tool for the right job. If you don't, "they" will be requiring you to use DataStage to generate all their reports.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
abhinavsuri
Premium Member
Premium Member
Posts: 62
Joined: Thu Dec 28, 2006 11:54 pm

Post by abhinavsuri »

Thanks a lot for the reply. I feel kind of foolish right nw for not thinking of the head stage. Thanks once again.
abhinavsuri
Premium Member
Premium Member
Posts: 62
Joined: Thu Dec 28, 2006 11:54 pm

Post by abhinavsuri »

But how can i get top3 for each group of (LOCATION,DEPT) using head stage. The head stage only gives top3 of entire data set
jcthornton
Premium Member
Premium Member
Posts: 79
Joined: Thu Mar 22, 2007 4:58 pm
Location: USA

Post by jcthornton »

I'm curious too what the need is for just getting the top three by LOCATION, DEPT. It does sound like reporting, which - as said above - is done better by other tools designed for that purpose.

If you must do it in DS, there are multiple ways to go about it. For instance:

Sort by LOCATION (asc), DEPT (asc), Salary (Desc) with partitioning set on LOCATION, DEPT.

Use a transformer with a stage variable that counts up, resetting on each new LOCATION, DEPT combination. Put a constraint on your output that svCOUNT <= 3.
abhinavsuri
Premium Member
Premium Member
Posts: 62
Joined: Thu Dec 28, 2006 11:54 pm

Post by abhinavsuri »

Thanks jc , I did it. Actually I had faced a similar challege before where I had to generate rowid' starting from 1 for each group. At that time I had loaded the sequential file to a TEMP Oracle table. Then I generated the ROW NUMBER's by using partition by clause. Was just just curious if we can achieve the same in datastage. I dont find normal ETL requirements that challenging anymore so just trying out new challenges in free time.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

It can easily be done in DataStage; I can think of at least three ways. But, since you like a challenge, I'll leave you to research them.

Here's another challenge - make sure your strategy works properly when the parallel job is running on more than one node.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply