Assuming I have a file with columns "Name", "Month" and "Salary" containing a company's employee names and their paid out salary values for months 1-12. If I explicitly hash partition on "Name" (primary) and "Month" (secondary) with a multinode configuration I know that if I have any joins on "Name" to datasets also partitioned on "Name" then I will be OK, but if I were to do a join based on "Month" with a dataset explicitly partitioned on "Month" would I need to repartition?
I'm not at a system where I can experiment right now otherwise I could test out what happens, but thought I might get some responses or opinions on how multiple column partitioning works from denizens of DSXchange.
Effects of Partitioning on multiple columns
Moderators: chulett, rschirm, roy
Just after I posted this thread the system I am working on came back up and I wrote a quick job to test this. I have now learned that DataStage takes a single hash of the combined keys and distributes accordingly. Thus, if I partition on "Name" and "Month" then a record with "Arnd"/"Sep"/"100" might go to partition 0 and a "Arnd"/Jan"/"30" might hash to another partition.
This tells me that in order for a join to function correctly in a multinode configuration both inputs must be identically partitioned; I had previously incorrectly assumed incorrectly that a primary hashed key would always join with another stream that also had that primary hashed key, regardless of other keys.
Just for information, I created a job which used those 3 columns given above with a fixed list of a couple of names, a cycle of all 12 months and a random salary, this was written to dataset with 8 nodes and explicitly partitioned and sorted on "Name" and "Month". I then used the commands to get the unique list from Partition 0, then repeated for the other partitions. The same "Name" field showed up in different partitions.
This tells me that in order for a join to function correctly in a multinode configuration both inputs must be identically partitioned; I had previously incorrectly assumed incorrectly that a primary hashed key would always join with another stream that also had that primary hashed key, regardless of other keys.
Just for information, I created a job which used those 3 columns given above with a fixed list of a couple of names, a cycle of all 12 months and a random salary, this was written to dataset with 8 nodes and explicitly partitioned and sorted on "Name" and "Month". I then used the commands
Code: Select all
orchadmin dump -field Name -field Month -part 0 TestDS2.ds | sort -u -
-
- Premium Member
- Posts: 23
- Joined: Wed Sep 10, 2008 4:46 pm
- Location: USA