Hi
Any one can help me out ...
i had a data like this deptno-10,10,20,20,30,30,40,40,50,50,60,60
here i want to use partition Technic to distribute data equally suppose i am using 4 node configuration here my stage is aggregation.
how i can distribute data equally using partition Technic
Moderators: chulett, rschirm, roy
-
- Participant
- Posts: 5
- Joined: Wed Nov 02, 2011 9:08 pm
- Location: Hyderabad
Hash Partion
Arun,
you can go for Key based partition like HASH, and specify the Key and also you can use sort stage before Aggregation stage.
you can go for Key based partition like HASH, and specify the Key and also you can use sort stage before Aggregation stage.
Jaypee
Your request is ambivalent:
From the subject one might guess that it is your main aim to distribute the rows in equal numbers to all nodes. With a non-unique key like your deptno this would be best achieved by a non-keybased partitioning method like round-robin, because this guarantees that all rows get equal shares of the input-data.
If you want to perform aggregations using deptno as group, however, you need to have all values of the same group (deptno) in the same partition. So You absolutely need a key-based partitioning method to get a correct result. This may lead to unequal distribution of your input data (skewing). In your example (four nodes - six values appearing twice each) you will get - at best - two nodes with 2 rows each and two nodes with 4 rows each. Mind you that your example has exal numbers of members for each deptno. With unequal numbers - say one very large and some smaller - skewing may become really significant.
In this case You might consider using combined keys for aggregation, if applicable.
As for partitioning: Auto will recognize the partitioning requirements for Aggregator, so you should be allright with that. Switching manually to round-robin will leave you with wrong aggregation-results and duplicate keys in your output. So choosing a partitioning-method manually carries a certain risk.
To check for skewing, set $APT_RECORD_COUNTS to true. You can see the distribution of your records over partitions in the log in director then.
DataStage 8.7 will have the ability to override incorrect partitioning-methods unsuitable to a defined task (producing warnings in the job-logs when this happens). I am quite curious what results this will produce when 8.7 gets distributed to some more sites.
From the subject one might guess that it is your main aim to distribute the rows in equal numbers to all nodes. With a non-unique key like your deptno this would be best achieved by a non-keybased partitioning method like round-robin, because this guarantees that all rows get equal shares of the input-data.
If you want to perform aggregations using deptno as group, however, you need to have all values of the same group (deptno) in the same partition. So You absolutely need a key-based partitioning method to get a correct result. This may lead to unequal distribution of your input data (skewing). In your example (four nodes - six values appearing twice each) you will get - at best - two nodes with 2 rows each and two nodes with 4 rows each. Mind you that your example has exal numbers of members for each deptno. With unequal numbers - say one very large and some smaller - skewing may become really significant.
In this case You might consider using combined keys for aggregation, if applicable.
As for partitioning: Auto will recognize the partitioning requirements for Aggregator, so you should be allright with that. Switching manually to round-robin will leave you with wrong aggregation-results and duplicate keys in your output. So choosing a partitioning-method manually carries a certain risk.
To check for skewing, set $APT_RECORD_COUNTS to true. You can see the distribution of your records over partitions in the log in director then.
DataStage 8.7 will have the ability to override incorrect partitioning-methods unsuitable to a defined task (producing warnings in the job-logs when this happens). I am quite curious what results this will produce when 8.7 gets distributed to some more sites.
"It is not the lucky ones are grateful.
There are the grateful those are happy." Francis Bacon
There are the grateful those are happy." Francis Bacon
-
- Participant
- Posts: 5
- Joined: Wed Nov 02, 2011 9:08 pm
- Location: Hyderabad
Re: Hash Partion
Thanks for ur reply Praveenjpraveen wrote:you can go for Key based partition like HASH, and specify the Key and also you can use sort stage before Aggregation stage.
But all same grouped keys will go to one output like that but remaining 50,60 deptno data where it wil go..?
Arun
Re: Hash Partion
As I already stated: each key value will result in a hash-value that determines in which node the data is processed. If you consider a distribution by modulo your values would all be processed by nodes 0 and 2 in a four node configuration. Using hash it depends on the hashing-algorythm used, which is probably pretty much the same with a single integer key-column.
"It is not the lucky ones are grateful.
There are the grateful those are happy." Francis Bacon
There are the grateful those are happy." Francis Bacon
-
- Participant
- Posts: 5
- Joined: Wed Nov 02, 2011 9:08 pm
- Location: Hyderabad
Re: Hash Partion
Thank u ronald.. for answering the question
In this case data is not equally distributed with hash Technic..
In this case data is not equally distributed with hash Technic..
Arun
Arun,
The goal of partitioning is this: Distribute the data as evenly as possible while at the same time meeting the requirements of the processing logic.
Because you must keep like-keyed records together in the same partition, you have no guarantee that you can evenly distribute your records--it is entirely dependent upon the distribution of key values and the number of partitions your job is using. The only exception is running your job with only one partition.
Hash partitioning uses an algorithm which determines what partition a record goes to based upon the number of partitions and the physical value of the data in your partition columns. The guarantee is that all records will go to a partition, and all records with identical key values will go to the same partition...NOT that they will be evenly distributed.
Regards,
The goal of partitioning is this: Distribute the data as evenly as possible while at the same time meeting the requirements of the processing logic.
Because you must keep like-keyed records together in the same partition, you have no guarantee that you can evenly distribute your records--it is entirely dependent upon the distribution of key values and the number of partitions your job is using. The only exception is running your job with only one partition.
Hash partitioning uses an algorithm which determines what partition a record goes to based upon the number of partitions and the physical value of the data in your partition columns. The guarantee is that all records will go to a partition, and all records with identical key values will go to the same partition...NOT that they will be evenly distributed.
Regards,
- james wiles
All generalizations are false, including this one - Mark Twain.
All generalizations are false, including this one - Mark Twain.