Hash Partioining - doubt

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
vij
Participant
Posts: 131
Joined: Fri Nov 17, 2006 12:43 am

Hash Partioining - doubt

Post by vij »

Hi all,

I have a job which process about 80 Million to 100 Million of records.
In a joiner stage, I have selected the hash partioning and used the key column, col1 (2000 different possible values for this column) and doing a sort and partition based on that key column value.

I have 2 doubts:

1. I have another one column, col2 which has about 60 possible values.Which column I should use for partitioning, col1 or col2?

2.Also, I have not sorted the incoming data, I was said that if I sort the data using the sorter stage, before this partition, performance will be better. Is it correct?

Thanks in advance!
kris007
Charter Member
Charter Member
Posts: 1102
Joined: Tue Jan 24, 2006 5:38 pm
Location: Riverside, RI

Post by kris007 »

Use a sort stage before a join stage and use hash partitioning in the sort stage. Then use the "same" partitioning in the join stage which follows. It is always better to use a sort stage for such heavy data rather than sorting within the join stage.

Answer to your first question, you have to sort and partition your data based upon the keys(columns) you would be joining.
Kris

Where's the "Any" key?-Homer Simpson
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Hash partition and sort identically on all inputs on all the join keys. Partitioning so that all instances of any one value occur on the same node, sorting so that changes in value are quickly detected and memory can be freed.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

The data sets input to the Join stage must be key partitioned and sorted. You might have noticed this in document. So if you Joining key is Col1 and if you do a partition on Col2, it wont help you.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
Post Reply