hi dsguru

A forum for discussing DataStage<sup>®</sup> basics. If you're not sure where your question goes, start here.

Moderators: chulett, rschirm, roy

Post Reply
swathik
Participant
Posts: 8
Joined: Sun Dec 30, 2007 12:31 am
Location: hyderabad

hi dsguru

Post by swathik »

Suppose for lookup primary data is having 10 million records and reference data is also having 10 million records then which type of partitioning use for primary data n which type of partitioning for reference data?

thanks in advance,
swathi
Minhajuddin
Participant
Posts: 467
Joined: Tue Mar 20, 2007 6:36 am
Location: Chennai
Contact:

Post by Minhajuddin »

Welcome aboard!

Straight out of Roy's post about posting ;)
Hi All,
Time and again people here post topics with poor description in the topic/additional info.

"Wierd behaviour"
"Please Help"
and any variation or topics of similar nature are not an acceptable topic :!: :evil: :!:

Though I try fixing some of theese posts I can't keep up with the incoming trafic
:(
Please post descriptive subject lines to get people with the relevant experience answer your questions appropriately.

Now to your actual question.
1) If you have almost the same number of records in your input as the number in the reference dataset, You should use a join stage.
2) The type of partitioning does not depend on the number of records at all. If you do any kind of lookup or aggregation, you should always use some keyed partitioning(Hash, Modulus).
Minhajuddin

<a href="http://feeds.feedburner.com/~r/MyExperi ... ~6/2"><img src="http://feeds.feedburner.com/MyExperienc ... lrow.3.gif" alt="My experiences with this DLROW" border="0"></a>
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Not "should", if you want correct results, but "must" use a key-based partitioning algorithm.

Note also that the Join stage requires that its inputs be sorted by the join key(s).

If both data sets are in the same database instance you may be better off performing the join there, where it may be able to be assisted by indexes. Also, DataStage would then only need to process the result of the join, which is likely to be fewer than the total rows you would need to bring into DataStage to perform the join there.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply