Aggregator : Group method = Hash

sridharvis · Post by **sridharvis** » Sun Oct 04, 2009 12:42 pm

Guys,

I have a job which does the following.

SeqFile --> (copy - Aggregator), (copy -- linkedsort) ---> Joiner -- > DataSet.

The SeqFile is having around 10,00,000 records , In the Aggregator stage i am performing Aggregation type : CountRows based on the keys (AcctId,CustId). I had set the Group Method as Hash.

The job takes 26 minutes to complete. Config file is 2 nodes, i changed the Group Method =(Sort) the same job completes in fewer minutes.

My question when it is recommended to use Group Method = Hash ?. Any suggestion would help.

ray.wurlod · Post by **ray.wurlod** » Sun Oct 04, 2009 1:39 pm

It really depends on how many distinct values of the keys exist in the data. The hash table that is built in memory has one row per distinct value. If you have many distinct values (small counts) then I would recommend Group Method = Sort. 1 lakh of records is not a large quantity for the Sort stage to manage.

Barath · Post by **Barath** » Mon Oct 05, 2009 1:52 am

If your source data is less means thousands you can go for Hash. If the the source data is huge means more than thousands you should go for method sort only ...other wise it throughs warnings.

Cheers.........

DSXchange

Aggregator : Group method = Hash

Aggregator : Group method = Hash

Aggregate Method