Which partition type I need to use in the JOIN stage
Moderators: chulett, rschirm, roy
-
- Participant
- Posts: 161
- Joined: Mon Mar 17, 2008 10:33 am
Which partition type I need to use in the JOIN stage
Hi guys, someone could help me? I need to do a JOIN, so I usually do this in two ways.
1. I select the patition type as Hash and I select my keys. I do this in both links (in the main link and in the reference link)
2. I select the partition type as Round robin in my main link and I select the partition type as Entire in my reference link.
I always did this in my JOBs and they always works, but today I did this and the result was wrong. The job only worked after I set the option Perform sort in both links.
Someone could explain me why this is happening?
Thanks,
Igor Bastos Martins
1. I select the patition type as Hash and I select my keys. I do this in both links (in the main link and in the reference link)
2. I select the partition type as Round robin in my main link and I select the partition type as Entire in my reference link.
I always did this in my JOBs and they always works, but today I did this and the result was wrong. The job only worked after I set the option Perform sort in both links.
Someone could explain me why this is happening?
Thanks,
Igor Bastos Martins
-
- Premium Member
- Posts: 353
- Joined: Mon Jan 17, 2011 5:03 am
- Location: Mumbai, India
-
- Participant
- Posts: 161
- Joined: Mon Mar 17, 2008 10:33 am
As Chandra said, DataStage will usually do necessary sorting and partitioning for you when using the "Auto"-partitioning feature and if you do not prevent adding sort-operators by parameter (APT_NO_SORT_INSERTION). As soon as You switch to manual mode, DataStage assumes that you take control of partitioning and sorting and won't override - unless it has knowledge of wrong sort-decisions (from an explicit sort in the same job that does not meet the requirements of a stage).
Join needs the data correctly partitioned and sorted from all input-links.
Join does not distinguish between stream and reference-data. All input-links are treated as stream. If your set of reference-data is small enough for Entire-Partioning, use Lookup instead of Join. Preloading reference-data to memory is more efficient in these cases and the lookup-stage gives you more options then join.
Join needs the data correctly partitioned and sorted from all input-links.
Join does not distinguish between stream and reference-data. All input-links are treated as stream. If your set of reference-data is small enough for Entire-Partioning, use Lookup instead of Join. Preloading reference-data to memory is more efficient in these cases and the lookup-stage gives you more options then join.
"It is not the lucky ones are grateful.
There are the grateful those are happy." Francis Bacon
There are the grateful those are happy." Francis Bacon
Does auto partition will work without flaw for any volume of data?chandra.shekhar@tcs.com wrote:You don't need to explicitly partition and sort the data, choosing the auto partitioning method will ensure that partitioning and sorting is done.
Especially in JOIN / LOOKUP?
Thanks
Ram
----------------------------------
Revealing your ignorance is fine, because you get a chance to learn.
Ram
----------------------------------
Revealing your ignorance is fine, because you get a chance to learn.
Personally, I prefer the method you are taking instead of using the auto partitioning that has been previously recommended. Yes, Auto normally works, but I like to be as explicit as possible. It ensures that the developer has considered sorting and partitioning thoroughly.
As to why your job suddenly is not producing accurate results; I could only guess, but from what you have said, it sounds like the data was not being sorted previously.
Here again, I like to be very explicit and will most often insert a sort stage prior to the join (or lookup) rather than using an "inline" sort. Partitioning alone is not sufficient. The data must be both sorted AND partitioned properly to produce accurate results.
As to why your job suddenly is not producing accurate results; I could only guess, but from what you have said, it sounds like the data was not being sorted previously.
Here again, I like to be very explicit and will most often insert a sort stage prior to the join (or lookup) rather than using an "inline" sort. Partitioning alone is not sufficient. The data must be both sorted AND partitioned properly to produce accurate results.
Bob
-
- Premium Member
- Posts: 353
- Joined: Mon Jan 17, 2011 5:03 am
- Location: Mumbai, India
Yes, it works all the time.Does auto partition will work without flaw for any volume of data?
Especially in JOIN / LOOKUP?
In my project we process lot of data(my client is the biggest Bank of India ).
We always use auto partitioning and till now there's no discrepancy in the data.
There's NO harm in explicitly partitioning and sorting the data.
As babyon said you have to sort the data too, try it once.
Thanx and Regards,
ETL User
ETL User
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Once more with feeling...
(Auto) partitioning yields the following results.
(Auto) partitioning yields the following results.
- On a sequential-to-parallel link --> Round Robin
On a parallel-to-parallel link --> Same
(On a sequential-to-sequential link partitioning is not applicable.)
On the reference input link to a Lookup stage --> Entire
On an input link to a DB2 stage --> DB2
On an input link to a stage that requires key-partitioned input --> Hash
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.