hi dsguru

swathik · Post by **swathik** » Sun Dec 30, 2007 8:41 am

Suppose for lookup primary data is having 10 million records and reference data is also having 10 million records then which type of partitioning use for primary data n which type of partitioning for reference data?

thanks in advance,
swathi

Minhajuddin · Post by **Minhajuddin** » Sun Dec 30, 2007 9:17 am

Welcome aboard!

Straight out of Roy's post about posting

Hi All,
Time and again people here post topics with poor description in the topic/additional info.

"Wierd behaviour"
"Please Help"
and any variation or topics of similar nature are not an acceptable topic

Though I try fixing some of theese posts I can't keep up with the incoming trafic

Please post descriptive subject lines to get people with the relevant experience answer your questions appropriately.

Now to your actual question.
1) If you have almost the same number of records in your input as the number in the reference dataset, You should use a join stage.
2) The type of partitioning does not depend on the number of records at all. If you do any kind of lookup or aggregation, you should always use some keyed partitioning(Hash, Modulus).

ray.wurlod · Post by **ray.wurlod** » Sun Dec 30, 2007 2:07 pm

Not "should", if you want correct results, but "must" use a key-based partitioning algorithm.

Note also that the Join stage requires that its inputs be sorted by the join key(s).

If both data sets are in the same database instance you may be better off performing the join there, where it may be able to be assisted by indexes. Also, DataStage would then only need to process the result of the join, which is likely to be fewer than the total rows you would need to bring into DataStage to perform the join there.