Lookup reference link

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
just4u_sharath
Premium Member
Premium Member
Posts: 236
Joined: Sun Apr 01, 2007 7:41 am
Location: Michigan

Lookup reference link

Post by just4u_sharath »

Generally we do the entire partitioning for reference links and auto for input.But my question why do we need to partition the reference link. however all the data is loaded int o memory, then there is no need for partition. Please clarify me regarding this.
Do we need to partition in SMP system.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Of course you don't need to partition. Just don't expect correct results if you don't. Even in an SMP environment, you need Entire partitioning to make use of shared memory. If you choose some other partitioning algorithm (hopefully a key-based algorithm such as Hash or Modulus) then any given key value will occur on only one partition.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
just4u_sharath
Premium Member
Premium Member
Posts: 236
Joined: Sun Apr 01, 2007 7:41 am
Location: Michigan

lookup stage

Post by just4u_sharath »

ray.wurlod wrote:Of course you don't need to partition. Just don't expect correct results if you don't. Even in an SMP environment, you need Entire partitioning to make use of shared memory. If you choose some other partitioning algorithm (hopefully a key-based algorithm such as Hash or Modulus) then any given key value will occur on only one partition.
Now here's my point. When talking about the reference link of lookup stage, there is no partitioning. All the data is loaded into memory.I mean no point ins saying nodes when talking abt reference links because the whole data is loaded into memory. if input of lookup stage has two partitions each partition can go and lookup the whole data in memroy. What is happening when we say entire for reference links and what happens when we say other partitioning for reference link. Its all only one memory. I cant understand. Please reply me on this because i am wondering abt this from many days. Physical memory---nodes
just4u_sharath
Premium Member
Premium Member
Posts: 236
Joined: Sun Apr 01, 2007 7:41 am
Location: Michigan

lookup stage

Post by just4u_sharath »

ray.wurlod wrote:Of course you don't need to partition. Just don't expect correct results if you don't. Even in an SMP environment, you need Entire partitioning to make use of shared memory. If you choose some other partitioning algorithm (hopefully a key-based algorithm such as Hash or Modulus) then any given key value will occur on only one partition.
What happenns if i use entire partitioning for input of lookup stage. does it create duplicate coopies.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

If you use Entire partitioning on the stream input of a Lookup stage and there is more than one processing node, then - yes - you will get duplicates. If there are N processing nodes you will get N copies of every row.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
just4u_sharath
Premium Member
Premium Member
Posts: 236
Joined: Sun Apr 01, 2007 7:41 am
Location: Michigan

Post by just4u_sharath »

ray.wurlod wrote:If you use Entire partitioning on the stream input of a Lookup stage and there is more than one processing node, then - yes - you will get duplicates. If there are N processing nodes you will get N copies of every row.
When talking about the reference link of lookup stage, there is no partitioning. All the data is loaded into memory.I mean no point ins saying nodes when talking abt reference links because the whole data is loaded into memory. if input of lookup stage has two partitions each partition can go and lookup the whole data in memroy. What is happening when we say entire for reference links and what happens when we say other partitioning for reference link. Its all only one memory. I cant understand. Please reply me on this because i am wondering abt this from many days. Physical memory---nodes
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

There is ALWAYS partitioning. Even if it's done logically using shared memory. Which only happens in the case of the Entire algorithm.

(Auto) will apply Entire if on the reference input link of a Lookup stage.

The reason that Entire is the default is that, no matter on which processing node a key from the stream input occurs, it is guaranteed to be able to find its "buddy" on the reference input, since a copy of its "buddy" occurs on every partition on the reference input.

If you identically partition the stream and reference inputs based on the lookup key(s), then the same applies - every key value from the stream input is guaranteed to find its "buddy" on the same partition.

On an SMP (share everything) environment the total demand for memory is the same as if Entire were used on the reference input, because Entire keeps a single copy of each row in shared memory.

On an MPP/grid (share nothing) environment, however, the "identically partitioned" approach will use rather less memory than the Entire algorithm, because the latter must have all rows available on all partitions (and physically available on all nodes).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
just4u_sharath
Premium Member
Premium Member
Posts: 236
Joined: Sun Apr 01, 2007 7:41 am
Location: Michigan

Post by just4u_sharath »

[quote="ray.wurlod"]There is ALWAYS partitioning. Even if it's done logically using shared memory. Which only happens in the case of the Entire algorithm.

(Auto) will apply Entire if on the reference input link of a Lookup stage.

The reason that Entire is the default is that, no matter on which processing node a key from the stream input occurs, it is guaranteed to be able to find its "buddy" on the reference input, since a copy of its "buddy" occurs on every partition on the reference input.

If you identically partition the stream and reference inputs based on the lookup key(s), then the same applies - every key value from the stream input is guaranteed to find its "buddy" on the same partition.


"since a copy of its "buddy" occurs on every partition on the reference input". This is the statement on which i have questions. Whne talking about input steam, partitioning is fine. But when talking about reference link, where is the point of partition. Partioning on Physical Ram. Howz that possible
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Each partition has a pointer to the string, which happens to be stored in the same location in shared memory in an SMP environment. Therefore the pointers on each partition happen to be identical.

In an MPP/grid environment each partition has a pointer to the string (in memory, in a virtual Data Set) in its own machine's memory.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply