Effects of Partitioning on multiple columns

ArndW · Post by **ArndW** » Sat Aug 01, 2009 8:09 am

Assuming I have a file with columns "Name", "Month" and "Salary" containing a company's employee names and their paid out salary values for months 1-12. If I explicitly hash partition on "Name" (primary) and "Month" (secondary) with a multinode configuration I know that if I have any joins on "Name" to datasets also partitioned on "Name" then I will be OK, but if I were to do a join based on "Month" with a dataset explicitly partitioned on "Month" would I need to repartition?

I'm not at a system where I can experiment right now otherwise I could test out what happens, but thought I might get some responses or opinions on how multiple column partitioning works from denizens of DSXchange.

ArndW · Post by **ArndW** » Sat Aug 01, 2009 8:29 am

Just after I posted this thread the system I am working on came back up and I wrote a quick job to test this. I have now learned that DataStage takes a single hash of the combined keys and distributes accordingly. Thus, if I partition on "Name" and "Month" then a record with "Arnd"/"Sep"/"100" might go to partition 0 and a "Arnd"/Jan"/"30" might hash to another partition.

This tells me that in order for a join to function correctly in a multinode configuration both inputs must be identically partitioned; I had previously incorrectly assumed incorrectly that a primary hashed key would always join with another stream that also had that primary hashed key, regardless of other keys.

Just for information, I created a job which used those 3 columns given above with a fixed list of a couple of names, a cycle of all 12 months and a random salary, this was written to dataset with 8 nodes and explicitly partitioned and sorted on "Name" and "Month". I then used the commands

Code: Select all

orchadmin dump -field Name -field Month -part 0 TestDS2.ds | sort -u -

to get the unique list from Partition 0, then repeated for the other partitions. The same "Name" field showed up in different partitions.

csc.datastage · Post by **csc.datastage** » Sun Aug 02, 2009 12:12 am

Thanks for posting this.
This is important piece of info...

Take care!