how i can distribute data equally using partition Technic

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
Arun Reddy
Participant
Posts: 5
Joined: Wed Nov 02, 2011 9:08 pm
Location: Hyderabad

how i can distribute data equally using partition Technic

Post by Arun Reddy »

Hi

Any one can help me out ...

i had a data like this deptno-10,10,20,20,30,30,40,40,50,50,60,60
here i want to use partition Technic to distribute data equally suppose i am using 4 node configuration here my stage is aggregation.
Arun
jpraveen
Participant
Posts: 71
Joined: Sat Jun 06, 2009 7:10 am
Location: HYD

Hash Partion

Post by jpraveen »

Arun,

you can go for Key based partition like HASH, and specify the Key and also you can use sort stage before Aggregation stage.
Jaypee
BI-RMA
Premium Member
Premium Member
Posts: 463
Joined: Sun Nov 01, 2009 3:55 pm
Location: Hamburg

Post by BI-RMA »

Your request is ambivalent:

From the subject one might guess that it is your main aim to distribute the rows in equal numbers to all nodes. With a non-unique key like your deptno this would be best achieved by a non-keybased partitioning method like round-robin, because this guarantees that all rows get equal shares of the input-data.

If you want to perform aggregations using deptno as group, however, you need to have all values of the same group (deptno) in the same partition. So You absolutely need a key-based partitioning method to get a correct result. This may lead to unequal distribution of your input data (skewing). In your example (four nodes - six values appearing twice each) you will get - at best - two nodes with 2 rows each and two nodes with 4 rows each. Mind you that your example has exal numbers of members for each deptno. With unequal numbers - say one very large and some smaller - skewing may become really significant.

In this case You might consider using combined keys for aggregation, if applicable.

As for partitioning: Auto will recognize the partitioning requirements for Aggregator, so you should be allright with that. Switching manually to round-robin will leave you with wrong aggregation-results and duplicate keys in your output. So choosing a partitioning-method manually carries a certain risk.

To check for skewing, set $APT_RECORD_COUNTS to true. You can see the distribution of your records over partitions in the log in director then.

DataStage 8.7 will have the ability to override incorrect partitioning-methods unsuitable to a defined task (producing warnings in the job-logs when this happens). I am quite curious what results this will produce when 8.7 gets distributed to some more sites.
"It is not the lucky ones are grateful.
There are the grateful those are happy." Francis Bacon
Arun Reddy
Participant
Posts: 5
Joined: Wed Nov 02, 2011 9:08 pm
Location: Hyderabad

Re: Hash Partion

Post by Arun Reddy »

jpraveen wrote:you can go for Key based partition like HASH, and specify the Key and also you can use sort stage before Aggregation stage.
Thanks for ur reply Praveen

But all same grouped keys will go to one output like that but remaining 50,60 deptno data where it wil go..?
Arun
BI-RMA
Premium Member
Premium Member
Posts: 463
Joined: Sun Nov 01, 2009 3:55 pm
Location: Hamburg

Re: Hash Partion

Post by BI-RMA »

As I already stated: each key value will result in a hash-value that determines in which node the data is processed. If you consider a distribution by modulo your values would all be processed by nodes 0 and 2 in a four node configuration. Using hash it depends on the hashing-algorythm used, which is probably pretty much the same with a single integer key-column.
"It is not the lucky ones are grateful.
There are the grateful those are happy." Francis Bacon
Arun Reddy
Participant
Posts: 5
Joined: Wed Nov 02, 2011 9:08 pm
Location: Hyderabad

Re: Hash Partion

Post by Arun Reddy »

Thank u ronald.. for answering the question

In this case data is not equally distributed with hash Technic..
Arun
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

Arun,

The goal of partitioning is this: Distribute the data as evenly as possible while at the same time meeting the requirements of the processing logic.

Because you must keep like-keyed records together in the same partition, you have no guarantee that you can evenly distribute your records--it is entirely dependent upon the distribution of key values and the number of partitions your job is using. The only exception is running your job with only one partition.

Hash partitioning uses an algorithm which determines what partition a record goes to based upon the number of partitions and the physical value of the data in your partition columns. The guarantee is that all records will go to a partition, and all records with identical key values will go to the same partition...NOT that they will be evenly distributed.

Regards,
- james wiles


All generalizations are false, including this one - Mark Twain.
Post Reply