Effects of Partitioning on multiple columns

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Effects of Partitioning on multiple columns

Post by ArndW »

Assuming I have a file with columns "Name", "Month" and "Salary" containing a company's employee names and their paid out salary values for months 1-12. If I explicitly hash partition on "Name" (primary) and "Month" (secondary) with a multinode configuration I know that if I have any joins on "Name" to datasets also partitioned on "Name" then I will be OK, but if I were to do a join based on "Month" with a dataset explicitly partitioned on "Month" would I need to repartition?

I'm not at a system where I can experiment right now otherwise I could test out what happens, but thought I might get some responses or opinions on how multiple column partitioning works from denizens of DSXchange.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Just after I posted this thread the system I am working on came back up and I wrote a quick job to test this. I have now learned that DataStage takes a single hash of the combined keys and distributes accordingly. Thus, if I partition on "Name" and "Month" then a record with "Arnd"/"Sep"/"100" might go to partition 0 and a "Arnd"/Jan"/"30" might hash to another partition.

This tells me that in order for a join to function correctly in a multinode configuration both inputs must be identically partitioned; I had previously incorrectly assumed incorrectly that a primary hashed key would always join with another stream that also had that primary hashed key, regardless of other keys.

Just for information, I created a job which used those 3 columns given above with a fixed list of a couple of names, a cycle of all 12 months and a random salary, this was written to dataset with 8 nodes and explicitly partitioned and sorted on "Name" and "Month". I then used the commands

Code: Select all

orchadmin dump -field Name -field Month -part 0 TestDS2.ds | sort -u -
to get the unique list from Partition 0, then repeated for the other partitions. The same "Name" field showed up in different partitions.
csc.datastage
Premium Member
Premium Member
Posts: 23
Joined: Wed Sep 10, 2008 4:46 pm
Location: USA

Post by csc.datastage »

Thanks for posting this.
This is important piece of info...

Take care!
Focus on ABC: Audit , Balance and Control..
Post Reply