Hi all,
I have a job which process about 80 Million to 100 Million of records.
In a joiner stage, I have selected the hash partioning and used the key column, col1 (2000 different possible values for this column) and doing a sort and partition based on that key column value.
I have 2 doubts:
1. I have another one column, col2 which has about 60 possible values.Which column I should use for partitioning, col1 or col2?
2.Also, I have not sorted the incoming data, I was said that if I sort the data using the sorter stage, before this partition, performance will be better. Is it correct?
Thanks in advance!
Hash Partioining - doubt
Moderators: chulett, rschirm, roy
Use a sort stage before a join stage and use hash partitioning in the sort stage. Then use the "same" partitioning in the join stage which follows. It is always better to use a sort stage for such heavy data rather than sorting within the join stage.
Answer to your first question, you have to sort and partition your data based upon the keys(columns) you would be joining.
Answer to your first question, you have to sort and partition your data based upon the keys(columns) you would be joining.
Kris
Where's the "Any" key?-Homer Simpson
Where's the "Any" key?-Homer Simpson
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Hash partition and sort identically on all inputs on all the join keys. Partitioning so that all instances of any one value occur on the same node, sorting so that changes in value are quickly detected and memory can be freed.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.