Hash Partitioning

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
Havoc
Participant
Posts: 110
Joined: Fri Nov 24, 2006 8:26 am

Hash Partitioning

Post by Havoc »

Hi,

I have a job design which looks like this:

Code: Select all


DB2 stage ----->  Transformer  ------> (rest of the job)
(1 node)

Now the link from the DB2 stage to Transformer has a hash partitioning method with unique sort. The key used for partitioning usually has distinct values with values recurring ocassionally. (eg: 1,2,3,5,5,6,8,9). Rest of the job has all links set to auto partitioning.

Now what i wanted to understand is how will the data be partitioned after Transformer stage? Will all the rows go into just one partition and propagate through the rest of the job?
asorrell
Posts: 1707
Joined: Fri Apr 04, 2003 2:00 pm
Location: Colleyville, Texas

Post by asorrell »

If I understand your question, you want to know what "auto" partitioning does.

I usually explain "Auto" as "Don't change partitioning unless the job decides it must do so to function correctly".

In your example you've hash partitioned in the transformer. This sets up the initial partitioning for the job. Data is hashed into partitions based on the key field(s) specified. The rest of the job will keep that partitioning in place for all the other stages if at all possible because of the "Auto" settings.

However there are cases in which DataStage may have to change partitioning to function correctly, including lookups, aggregations and sorts. In those cases, with partitioning set to "Auto", it will change partitioning as appropriate to meet the job requirements.

Note: as a "best practice" I always explicitly change partitioning when required instead of relying on "Auto" - this will help developers that follow you easily see where partitioning has been explicitly changed.

The only time parititioning will "go away" mid-job is when you
1) dump records to sequential output (flat file or non-partitioned table)
2) explicitly override a stage to "sequential" mode instead of parallel (stage -> advanced options -> execution mode)
3) set a stage to use a single-node configuration file (stage -> advanced options -> node constraints)

If you want to learn more about what partitioning methodolgy is actually being used by "Auto" in a job you can read up on $APT_DUMP_SCORE environment variable (mentioned in numerous previous posts) and some other debugging options available to give details on what is actually going on in the background.
Andy Sorrell
Certified DataStage Consultant
IBM Analytics Champion 2009 - 2020
Havoc
Participant
Posts: 110
Joined: Fri Nov 24, 2006 8:26 am

Post by Havoc »

asorrell wrote:If I understand your question, you want to know what "auto" partitioning does.

I usually explain "Auto" as "Don't change partitioning unless the job decides it must do so to function correctly".

In your example you've hash partitioned in the transformer. This sets up the initial partitioning for the job. Data is hashed into partitions based on the key field(s) specified. The rest of the job will keep that partitioning in place for all the other stages if at all possible because of the "Auto" settings.

However there are cases in which DataStage may have to change partitioning to function correctly, including lookups, aggregations and sorts. In those cases, with partitioning set to "Auto", it will change partitioning as appropriate to meet the job requirements.

Note: as a "best practice" I always explicitly change partitioning when required instead of relying on "Auto" - this will help developers that follow you easily see where partitioning has been explicitly changed.

The only time parititioning will "go away" mid-job is when you
1) dump records to sequential output (flat file or non-partitioned table)
2) explicitly override a stage to "sequential" mode instead of parallel (stage -> advanced options -> execution mode)
3) set a stage to use a single-node configuration file (stage -> advanced options -> node constraints)

If you want to learn more about what partitioning methodolgy is actually being used by "Auto" in a job you can read up on $APT_DUMP_SCORE environment variable (mentioned in numerous previous posts) and some other debugging options available to give details on what is actually going on in the background.
Thanks a lot for that detailed reply Andy :) . Found that post very helpful. What exactly would happen in the third case you mentioned? In the rest of the job there are lookup stages on a single node. Would this result in the final data (which is being loaded to a dataset) to reside in just one partition? I'm pretty confused because when i use the hash unique sort, the data lands on one partition, but when removed the dataset is partitioned into four.
Post Reply