Aggregator : Group method = Hash

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
sridharvis
Premium Member
Premium Member
Posts: 26
Joined: Thu Apr 17, 2008 1:38 pm
Location: Chennai

Aggregator : Group method = Hash

Post by sridharvis »

Guys,

I have a job which does the following.

SeqFile --> (copy - Aggregator), (copy -- linkedsort) ---> Joiner -- > DataSet.

The SeqFile is having around 10,00,000 records , In the Aggregator stage i am performing Aggregation type : CountRows based on the keys (AcctId,CustId). I had set the Group Method as Hash.

The job takes 26 minutes to complete. Config file is 2 nodes, i changed the Group Method =(Sort) the same job completes in fewer minutes.

My question when it is recommended to use Group Method = Hash ?. Any suggestion would help.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

It really depends on how many distinct values of the keys exist in the data. The hash table that is built in memory has one row per distinct value. If you have many distinct values (small counts) then I would recommend Group Method = Sort. 1 lakh of records is not a large quantity for the Sort stage to manage.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Barath
Participant
Posts: 17
Joined: Mon Sep 29, 2008 4:00 am
Location: Mumbai

Aggregate Method

Post by Barath »

If your source data is less means thousands you can go for Hash. If the the source data is huge means more than thousands you should go for method sort only ...other wise it throughs warnings.

Cheers......... :lol:
Post Reply