Look up and look up file set

ketanshah123 · Post by **ketanshah123** » Sun May 25, 2008 11:19 pm

if we use a lookup stage the reference data gets looked up in the memory. Just a query here ....in case we use Lookup File sets as reference for a look up stage....would it still act as an overhead to the memory....

thanx in advance

devidotcom · Post by **devidotcom** » Sun May 25, 2008 11:20 pm

Yes it will.

ketanshah123 · Post by **ketanshah123** » Sun May 25, 2008 11:24 pm

devidotcom wrote:Yes it will.

thnx for reply but can you exlain it how....

devidotcom · Post by **devidotcom** » Sun May 25, 2008 11:31 pm

From one of Ray's post..

--------------------------------------------------------------------------------

Warning - Technical Content
The reference input to a Lookup stage for a normal (not sparse) lookup causes a composite operator to be generated to perform two tasks, for which the operator names are LUT_CreateOp and LUT_ProcessOp.

LUT_ProcessOp loads the virtual data set associated with the reference link into memory and builds an index (a hash table) through which that data set can be accessed by key.

If, however, the reference link is fed by a Lookup File Set stage, the index has already been created when the Lookup File Set was populated, so it can be moved into memory rather than built at run time. This ought to be faster.

Parallelism of Lookup File Set is handled in the same way as all other stage types, by the partitioning (when written) and execution mode properties, and possibly by the preserve partitioning setting of the upstream stage. However, if it is too small, it will be created on only one node. Too small may be either less than 32KB or less than 128KB (or other, depending upon certain environment variables). Orchestrate does not move data in smaller units than 32KB.

LUT = lookup table

So lookup fileset is moved into memory!!!

ketanshah123 · Post by **ketanshah123** » Mon May 26, 2008 12:08 am

devidotcom wrote:From one of Ray's post..
thnx you very much .....

--------------------------------------------------------------------------------

Warning - Technical Content
The reference input to a Lookup stage for a normal (not sparse) lookup causes a composite operator to be generated to perform two tasks, for which the operator names are LUT_CreateOp and LUT_ProcessOp.

LUT_ProcessOp loads the virtual data set associated with the reference link into memory and builds an index (a hash table) through which that data set can be accessed by key.

If, however, the reference link is fed by a Lookup File Set stage, the index has already been created when the Lookup File Set was populated, so it can be moved into memory rather than built at run time. This ought to be faster.

Parallelism of Lookup File Set is handled in the same way as all other stage types, by the partitioning (when written) and execution mode properties, and possibly by the preserve partitioning setting of the upstream stage. However, if it is too small, it will be created on only one node. Too small may be either less than 32KB or less than 128KB (or other, depending upon certain environment variables). Orchestrate does not move data in smaller units than 32KB.

LUT = lookup table

So lookup fileset is moved into memory!!!

ketanshah123 · Post by **ketanshah123** » Mon May 26, 2008 12:09 am

devidotcom wrote:From one of Ray's post..
thnx you very much .....

--------------------------------------------------------------------------------

Warning - Technical Content
The reference input to a Lookup stage for a normal (not sparse) lookup causes a composite operator to be generated to perform two tasks, for which the operator names are LUT_CreateOp and LUT_ProcessOp.

LUT_ProcessOp loads the virtual data set associated with the reference link into memory and builds an index (a hash table) through which that data set can be accessed by key.

If, however, the reference link is fed by a Lookup File Set stage, the index has already been created when the Lookup File Set was populated, so it can be moved into memory rather than built at run time. This ought to be faster.

Parallelism of Lookup File Set is handled in the same way as all other stage types, by the partitioning (when written) and execution mode properties, and possibly by the preserve partitioning setting of the upstream stage. However, if it is too small, it will be created on only one node. Too small may be either less than 32KB or less than 128KB (or other, depending upon certain environment variables). Orchestrate does not move data in smaller units than 32KB.

LUT = lookup table

So lookup fileset is moved into memory!!!

ray.wurlod · Post by **ray.wurlod** » Mon May 26, 2008 12:28 am

It is loaded into memory, but I take issue with the word "overhead".

As I originally posted, every non-sparse lookup reference link involves a virtual Data Set (and therefore being loaded into memory). So the use of a Lookup File Set as the source does not impose any additional overhead compared to other stage types. Indeed, since its index (hash table) has already been created, it is likely to be more efficient than most other stage types when servicing a reference input link to a Lookup stage.

abc123 · Post by **abc123** » Fri May 30, 2008 10:40 pm

Ray, if a lookup fileset is replaced by a dataset, and hash partitioning was used to write to the dataset, wouldn't performance by the same during looking up in both cases, with the added advantage being that you can view data in a dataset whereas you cannot in a lookup fileset?

ray.wurlod · Post by **ray.wurlod** » Sat May 31, 2008 2:14 am

Marginal. With a Data Set the index (hash table) has to be built; with a Lookup File Set the index already exists and only needs to be moved into memory. For large reference sets the difference is negligible; for smaller reference sets it will be noticeable.

The ability to view data is purely cosmetic.