Partition methods vs performance

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
dsuser_cai
Premium Member
Premium Member
Posts: 151
Joined: Fri Feb 13, 2009 4:19 pm

Partition methods vs performance

Post by dsuser_cai »

Hi

I have a question regarding partition methods and performance. I have simple job that extracts data from an oracle table (three columns) and use an aggregator stage to sum and group and then load into an oracle table (i create the target table).

Code: Select all

oracle_stage----->agg_stage---->oracle_stage
I tried the three partitions: Hash, modulus, round robin.
Colum names: amount, new_seq, old_seq.
total rows: 16Million.
Did not use sort stage but used the sort option in the Aggregator stage (sorted by new_seq column.)

this is the run time for the job with different partition methods:

Hash Partition: 12 minutes
Round Robin: 10minutes, 44 seconds
Modulus: 11 minutes, 25 seconds.

Also for all the three times I got a warning message:
Aggregator_stg: Hash table has grown to xxxx entries.

I tried to check the forum and got that (viewtopic.php?t=126271&highlight=Hash+t ... s+grown+to)

aggregation is done based on hash table by default -> can anybody explain me about this warning.

Thanks in advance
Thanks
Karthick
dsuser_cai
Premium Member
Premium Member
Posts: 151
Joined: Fri Feb 13, 2009 4:19 pm

Post by dsuser_cai »

Im sorry for the incomplete post:

My question is
1) which partition would be best for my situation:
amount(decimal), new_Seq (double), old_seq(double)

im just summing the amount column and groupin by old and new_seq column.
Thanks
Karthick
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

There are two aggregation methods - hash and sort. The former builds a hash table in memory containing the result set, the latter relies on sorted input to avoid the need to do so. There is a memory consumption limit for the hash table at which the alert message is triggered.

Modulus algorithm is intended for integers only, but might just work for a decimal key. Any other data type modulus is not guaranteed to provide key adjacency, so performance ceases to be an issue - unless wrong results are acceptable.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
asorrell
Posts: 1707
Joined: Fri Apr 04, 2003 2:00 pm
Location: Colleyville, Texas

Post by asorrell »

Karthick,

Ok - first of all, there are more important impacts than performance on selecting a partitioning method. In your case, since you are doing an aggregation, you need to insure that each "group" that is being aggregated is using the same node. If the data is not grouped correctly (which I can almost guarantee happened for "Round Robin") then your results will be wrong, no matter how fast they are.

Though it is hard to make a recommendation, because there isn't enough detail about data and groups, in general you can try using a hashing algorithm and hashing on the field that you are grouping for your aggregation calculation.

Secondly, the warning about "Hash table growing to nnnn entries" is because it is keep all your data in memory while it is processing the entire file, resulting in a very large (hashed) memory structure. IF your data is already sorted on the aggregation key, then go into your aggregation stage and change Options: Method:"Hash" to Options: Method: "Sort". Note: if your data is NOT sorted correctly, this will create bad totals. However, if your data is sorted, it should improve performance dramatically as well as get rid of the warning messages.
Andy Sorrell
Certified DataStage Consultant
IBM Analytics Champion 2009 - 2020
dsuser_cai
Premium Member
Premium Member
Posts: 151
Joined: Fri Feb 13, 2009 4:19 pm

Post by dsuser_cai »

Hi

That was a very clear explanation, I will do more research and let you know about the performance. untill then I will keep this open.
Thanks
Karthick
Post Reply