Suppose for lookup primary data is having 10 million records and reference data is also having 10 million records then which type of partitioning use for primary data n which type of partitioning for reference data?
thanks in advance,
swathi
hi dsguru
Moderators: chulett, rschirm, roy
-
- Participant
- Posts: 467
- Joined: Tue Mar 20, 2007 6:36 am
- Location: Chennai
- Contact:
Welcome aboard!
Straight out of Roy's post about posting
Now to your actual question.
1) If you have almost the same number of records in your input as the number in the reference dataset, You should use a join stage.
2) The type of partitioning does not depend on the number of records at all. If you do any kind of lookup or aggregation, you should always use some keyed partitioning(Hash, Modulus).
Straight out of Roy's post about posting
Please post descriptive subject lines to get people with the relevant experience answer your questions appropriately.Hi All,
Time and again people here post topics with poor description in the topic/additional info.
"Wierd behaviour"
"Please Help"
and any variation or topics of similar nature are not an acceptable topic
Though I try fixing some of theese posts I can't keep up with the incoming trafic
Now to your actual question.
1) If you have almost the same number of records in your input as the number in the reference dataset, You should use a join stage.
2) The type of partitioning does not depend on the number of records at all. If you do any kind of lookup or aggregation, you should always use some keyed partitioning(Hash, Modulus).
Minhajuddin
<a href="http://feeds.feedburner.com/~r/MyExperi ... ~6/2"><img src="http://feeds.feedburner.com/MyExperienc ... lrow.3.gif" alt="My experiences with this DLROW" border="0"></a>
<a href="http://feeds.feedburner.com/~r/MyExperi ... ~6/2"><img src="http://feeds.feedburner.com/MyExperienc ... lrow.3.gif" alt="My experiences with this DLROW" border="0"></a>
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Not "should", if you want correct results, but "must" use a key-based partitioning algorithm.
Note also that the Join stage requires that its inputs be sorted by the join key(s).
If both data sets are in the same database instance you may be better off performing the join there, where it may be able to be assisted by indexes. Also, DataStage would then only need to process the result of the join, which is likely to be fewer than the total rows you would need to bring into DataStage to perform the join there.
Note also that the Join stage requires that its inputs be sorted by the join key(s).
If both data sets are in the same database instance you may be better off performing the join there, where it may be able to be assisted by indexes. Also, DataStage would then only need to process the result of the join, which is likely to be fewer than the total rows you would need to bring into DataStage to perform the join there.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.