Guys,
I have a job which does the following.
SeqFile --> (copy - Aggregator), (copy -- linkedsort) ---> Joiner -- > DataSet.
The SeqFile is having around 10,00,000 records , In the Aggregator stage i am performing Aggregation type : CountRows based on the keys (AcctId,CustId). I had set the Group Method as Hash.
The job takes 26 minutes to complete. Config file is 2 nodes, i changed the Group Method =(Sort) the same job completes in fewer minutes.
My question when it is recommended to use Group Method = Hash ?. Any suggestion would help.
Aggregator : Group method = Hash
Moderators: chulett, rschirm, roy
-
- Premium Member
- Posts: 26
- Joined: Thu Apr 17, 2008 1:38 pm
- Location: Chennai
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
It really depends on how many distinct values of the keys exist in the data. The hash table that is built in memory has one row per distinct value. If you have many distinct values (small counts) then I would recommend Group Method = Sort. 1 lakh of records is not a large quantity for the Sort stage to manage.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Aggregate Method
If your source data is less means thousands you can go for Hash. If the the source data is huge means more than thousands you should go for method sort only ...other wise it throughs warnings.
Cheers.........
Cheers.........