Lookup reference link

just4u_sharath · Post by **just4u_sharath** » Thu Jan 31, 2008 1:28 am

Generally we do the entire partitioning for reference links and auto for input.But my question why do we need to partition the reference link. however all the data is loaded int o memory, then there is no need for partition. Please clarify me regarding this.
Do we need to partition in SMP system.

ray.wurlod · Post by **ray.wurlod** » Thu Jan 31, 2008 7:33 am

Of course you don't need to partition. Just don't expect correct results if you don't. Even in an SMP environment, you need Entire partitioning to make use of shared memory. If you choose some other partitioning algorithm (hopefully a key-based algorithm such as Hash or Modulus) then any given key value will occur on only one partition.

just4u_sharath · Post by **just4u_sharath** » Thu Jan 31, 2008 2:52 pm

ray.wurlod wrote:Of course you don't need to partition. Just don't expect correct results if you don't. Even in an SMP environment, you need Entire partitioning to make use of shared memory. If you choose some other partitioning algorithm (hopefully a key-based algorithm such as Hash or Modulus) then any given key value will occur on only one partition.

Now here's my point. When talking about the reference link of lookup stage, there is no partitioning. All the data is loaded into memory.I mean no point ins saying nodes when talking abt reference links because the whole data is loaded into memory. if input of lookup stage has two partitions each partition can go and lookup the whole data in memroy. What is happening when we say entire for reference links and what happens when we say other partitioning for reference link. Its all only one memory. I cant understand. Please reply me on this because i am wondering abt this from many days. Physical memory---nodes

just4u_sharath · Post by **just4u_sharath** » Thu Jan 31, 2008 2:58 pm

ray.wurlod wrote:Of course you don't need to partition. Just don't expect correct results if you don't. Even in an SMP environment, you need Entire partitioning to make use of shared memory. If you choose some other partitioning algorithm (hopefully a key-based algorithm such as Hash or Modulus) then any given key value will occur on only one partition.

What happenns if i use entire partitioning for input of lookup stage. does it create duplicate coopies.

ray.wurlod · Post by **ray.wurlod** » Thu Jan 31, 2008 4:42 pm

If you use Entire partitioning on the stream input of a Lookup stage and there is more than one processing node, then - yes - you will get duplicates. If there are N processing nodes you will get N copies of every row.

just4u_sharath · Post by **just4u_sharath** » Thu Jan 31, 2008 6:18 pm

ray.wurlod wrote:If you use Entire partitioning on the stream input of a Lookup stage and there is more than one processing node, then - yes - you will get duplicates. If there are N processing nodes you will get N copies of every row.

When talking about the reference link of lookup stage, there is no partitioning. All the data is loaded into memory.I mean no point ins saying nodes when talking abt reference links because the whole data is loaded into memory. if input of lookup stage has two partitions each partition can go and lookup the whole data in memroy. What is happening when we say entire for reference links and what happens when we say other partitioning for reference link. Its all only one memory. I cant understand. Please reply me on this because i am wondering abt this from many days. Physical memory---nodes

ray.wurlod · Post by **ray.wurlod** » Thu Jan 31, 2008 10:03 pm

There is ALWAYS partitioning. Even if it's done logically using shared memory. Which only happens in the case of the Entire algorithm.

(Auto) will apply Entire if on the reference input link of a Lookup stage.

The reason that Entire is the default is that, no matter on which processing node a key from the stream input occurs, it is guaranteed to be able to find its "buddy" on the reference input, since a copy of its "buddy" occurs on every partition on the reference input.

If you identically partition the stream and reference inputs based on the lookup key(s), then the same applies - every key value from the stream input is guaranteed to find its "buddy" on the same partition.

On an SMP (share everything) environment the total demand for memory is the same as if Entire were used on the reference input, because Entire keeps a single copy of each row in shared memory.

On an MPP/grid (share nothing) environment, however, the "identically partitioned" approach will use rather less memory than the Entire algorithm, because the latter must have all rows available on all partitions (and physically available on all nodes).

just4u_sharath · Post by **just4u_sharath** » Fri Feb 01, 2008 3:42 pm

[quote="ray.wurlod"]There is ALWAYS partitioning. Even if it's done logically using shared memory. Which only happens in the case of the Entire algorithm.

(Auto) will apply Entire if on the reference input link of a Lookup stage.

The reason that Entire is the default is that, no matter on which processing node a key from the stream input occurs, it is guaranteed to be able to find its "buddy" on the reference input, since a copy of its "buddy" occurs on every partition on the reference input.

If you identically partition the stream and reference inputs based on the lookup key(s), then the same applies - every key value from the stream input is guaranteed to find its "buddy" on the same partition.

"since a copy of its "buddy" occurs on every partition on the reference input". This is the statement on which i have questions. Whne talking about input steam, partitioning is fine. But when talking about reference link, where is the point of partition. Partioning on Physical Ram. Howz that possible

ray.wurlod · Post by **ray.wurlod** » Fri Feb 01, 2008 4:59 pm

Each partition has a pointer to the string, which happens to be stored in the same location in shared memory in an SMP environment. Therefore the pointers on each partition happen to be identical.

In an MPP/grid environment each partition has a pointer to the string (in memory, in a virtual Data Set) in its own machine's memory.