Lookup file sets ??????

aakashahuja · Post by **aakashahuja** » Sun Dec 16, 2007 12:17 am

Hi,

Since I could not find answers to my questions, hence m here

Want to know more about lookup file sets... how do they atually work ????
do they use hashing? why is a lookup file set always created on the first node defined in the config file???

and if that is the case, then what kind of parallel operation does it do since it gets created on one node always???? any other relevant info..!!!

If these questions are answered somewhere then please point me to that doc / link...

Cheers
Aakash

ray.wurlod · Post by **ray.wurlod** » Sun Dec 16, 2007 2:39 pm

Warning - Technical Content
The reference input to a Lookup stage for a normal (not sparse) lookup causes a composite operator to be generated to perform two tasks, for which the operator names are LUT_CreateOp and LUT_ProcessOp.

LUT_ProcessOp loads the virtual data set associated with the reference link into memory and builds an index (a hash table) through which that data set can be accessed by key.

If, however, the reference link is fed by a Lookup File Set stage, the index has already been created when the Lookup File Set was populated, so it can be moved into memory rather than built at run time. This ought to be faster.

Parallelism of Lookup File Set is handled in the same way as all other stage types, by the partitioning (when written) and execution mode properties, and possibly by the preserve partitioning setting of the upstream stage. However, if it is too small, it will be created on only one node. Too small may be either less than 32KB or less than 128KB (or other, depending upon certain environment variables). Orchestrate does not move data in smaller units than 32KB.

LUT = lookup table

aakashahuja · Post by **aakashahuja** » Wed Mar 05, 2008 11:39 pm

Too small may be either less than 32KB or less than 128KB (or other, depending upon certain environment variables)

Can you please explain what environment variables are those?

P.S:- The reason I have reopend this topic is that I just tried to write a lookup file set 52 MBs in size and it still got created just on the conductor node (the config file has 2 nodes)?

Job design : Row generator ----> Lkup File Set

Cheers
Aakash

ray.wurlod · Post by **ray.wurlod** » Wed Mar 05, 2008 11:52 pm

How do you know on which node(s) the Lookup File Set was created? The control file (xyz.fs) is possibly created on the conductor node, but how have you determined the location(s) of the data file(s)? The control file - sometimes called the descriptor file - is not the Lookup File Set itself.

ray.wurlod · Post by **ray.wurlod** » Wed Mar 05, 2008 11:58 pm

Warning - Technical Content (again)

The descriptor file for a File Set or a Lookup File Set has a name ending in ".fs". Nevertheless the descriptor file itself is a text file, and can be examined with a text editor to determine the location(s) of the data file(s) comprising the File Set.

Premium members can read more about this here which is a prototype for something that will ultimately grace the DSXchange Learning Center (where the link in the document will work properly).

aakashahuja · Post by **aakashahuja** » Thu Mar 06, 2008 12:03 am

By observing the lookup file set descriptor file, I come to know the nodes and segment file location: Here is my descriptor file:

Code: Select all

--Orchestrate File Set v2
--LFile
node1:/vol/DataStage/tmp/Datasets/lookuptable.20080306.aj0zfqc
--Schema
record {LUTVersion="1"}
( KeyCol: int32 {dropped};
  texta: string;
)

As you can see, it is created on one node only while
1. my config file has 2 ndoes defined.
2. data is about 52 mb in size.

CHeers
Aakash

aakashahuja · Post by **aakashahuja** » Thu Mar 06, 2008 12:08 am

By observing the lookup file set descriptor file, I come to know the nodes and segment file location: Here is my descriptor file:

Code: Select all

--Orchestrate File Set v2
--LFile
node1:/vol/DataStage/tmp/Datasets/lookuptable.20080306.aj0zfqc
--Schema
record {LUTVersion="1"}
( KeyCol: int32 {dropped};
  texta: string;
)

As you can see, it is created on one node only while
1. my config file has 2 ndoes defined.
2. data is about 52 mb in size.

CHeers
Aakash

aakashahuja · Post by **aakashahuja** » Thu Mar 06, 2008 2:21 am

ray.wurlod · Post by **ray.wurlod** » Thu Mar 06, 2008 2:23 am

Please report the result of the following command:

Code: Select all

ls -l /vol/DataStage/tmp/Datasets/lookuptable.20080306.aj0zfqc

aakashahuja · Post by **aakashahuja** » Thu Mar 06, 2008 2:29 am

Here it is:

Code: Select all

-rwxrwx---   1 myuser mygroup   54553192 Mar 06 05:47 /vol/DataStage/tmp/Datasets/lookuptable.20080306.aj0zfqc

rony_daniel · Post by **rony_daniel** » Thu May 01, 2008 3:00 pm

Hi,

What is the best partition type that should be given when a lookup file set is created with a key?

By default the partition type that comes when we drag and drop this stage to a job is "Entire". Will Entire partition cause the data to be written mutltiple number of times depending on the number of nodes and hence occupying a huge amount of space in the unix box?