Aggregator + Performance

dsuser_cai · Post by **dsuser_cai** » Sat Oct 24, 2009 1:37 pm

Hi

I have the following data:

Loan_Id   Amount  Produts_Id  Store_Id
1             1000.9   55              12
1              .082     55              12
2             2000     55               32
3              0.8776  55              05
2             9.8        55              32
3            1000      55               05

the total number of records is 28.4 million.
No Of Nodes: 2
stages used:

Code: Select all

 oracle---->agg (group by loan_id, product_id, store_id, sum(amount) partition was set to auto ----> sequential file

the total run time for this job was 11.06 minutes.

now i modified the job with the following:

Code: Select all

 oracle---->sort_stage (order by loan_id)--->agg (group by loan_id, product_id, store_id, sum(amount) partition was set to Hash (loan_id, products_id, store_id) ----> sequential file

and the run time was 11.01 minutes.
i do not see any significat improvement in the performance. can somebody help me where im wrong.

This is the only job that is running in the environment, no other jobs or applications are running in the environment.

dsuser_cai · Post by **dsuser_cai** » Sat Oct 24, 2009 1:40 pm

The oracle stage has a simple select query.

chulett · Post by **chulett** » Sat Oct 24, 2009 2:51 pm

Why not just do all that in Oracle? And why would you expect adding the overhead of a sort stage sorting on one field to improve overall performance? If you want to go that route, you'd need to sort by all three grouping columns (not just the first one) and somehow make sure the Aggregator knew you'd done that. It looks like PX would need the optional Method to be set to Sort for that. Make sure you read the help text for Method when you're in there, and understand the prerequisites.

dsuser_cai · Post by **dsuser_cai** » Sat Oct 24, 2009 3:46 pm

Yeah i tried in Oracle, and the entire run time was only 5.02 minutes. thanks criag.

ray.wurlod · Post by **ray.wurlod** » Sat Oct 24, 2009 4:43 pm

You probably got the wrong answers using Auto partitioning.

mgendy · Post by **mgendy** » Tue Oct 27, 2009 6:55 am

You probably got the wrong answers using Auto partitioning

first i agree with what ray said , you may get wrong results using auto partitioning while you are using Hash aggregator , that propably happen with massive number of data

second , why do you expected more performance when using sorting and partitioning explicity ? it doesn't happend in all times because auto partitioning method by default try to fine the best partitioning method to use to get the most performance , so there are many casese 'like your case' while change the method of partitioning and sorting data doesn't imporve performance

Tird , just reminder to change agregator method to sort 'from agregator stage --> options --> mthod = sort' while using EXPLICIT sort , and partitioning , that may imporve the performance in your case , because sorting data and using hash partition method , does not make any effect to hash agregator type

Regards