Page 1 of 1

Lookup design and performance

Posted: Mon Jun 01, 2009 11:21 am
by just4geeks
Hi

I have a scenario in which we need to do a series of lookups on a source file. Some of them are conditional and some of them are straightforward.
Here is more details of scenario:
Source file name: SrcFile1 ( 40 Million records)
Lookup File Names: LkpA, LkpB, LkpC, LkpD, LkpE ( each contains between 100,000 to 500000 records)

Rules:
1. Perform LkupA on each record of SrcFile1. if Lkup A fails on certain records, write those records to an exception file ( ExceptionFile1)
2. Perform LkpB on those record of SrcFile1 which passed LkpA. If LkpB fails on certain records, perform Lookup LkpC on failed source records.If LkpC also fails, write out all failed source records in an exception file ( ExceptionFile2).
3. Perform LkpD in those records which passed condition 2 above. If LkpD fails, write it to an exception file ( ExceptionFile3)
4. LkpD file contains a non key fields ( LkpD.Col1). After performing step 3, we will get Col1 from LkpD which is used to perform looup in LkpE( Col1=LkpE.col1). Perform LkpE in those records which passed condition 3 above. If LkpE fails, write it to an exception file (ExceptionFile4).


I thought of 3 ways to do it:

1. Perform all lookups in one job. Separate transformer Stages will be used to do each lookup and write an exception file in each stage.
2. Perform all lookups in one job. only One transformer stage will be used to perform all lookups and write out exceptions.
3. Design separate jobs for each lookup.

My questions are:
1. Is it possible to do option 2. Can I use just 1 transformer to implement whole logic particularly rule#2and 4?
2. which design is good for optimum performance. Particularly I would like to know between option 1 and 2. I know design option 3 is not optimal as it requires landing of data at each job. In option 1, we can take advantage of inprocess row buffering between stages. But I am not sure if we can do that in option2.

Please share your experience and suggestions.
I will really appreciate them.

Thanks in advance!

Re: Lookup design and performance

Posted: Mon Jun 01, 2009 1:08 pm
by mamir_30
I would say try to use a Join stage instead of lookup. When your input file is 40 mil and lookup is half a mil, Join stage might be a better option.

Posted: Mon Jun 01, 2009 1:22 pm
by chulett
Server job.

Posted: Mon Jun 01, 2009 1:47 pm
by just4geeks
As Craig said, this is a server job so we can't use Join Stage.
The lookup files are well within a hashed file size limit so using hashed file for lookup is not a problem.

Thanks!

Posted: Mon Jun 01, 2009 4:31 pm
by ray.wurlod
Moderator: please move to server forum

Option 1 can be done, but Option 2 will give better throughput particularly if inter-process row buffering is enabled.

Posted: Tue Jun 02, 2009 2:28 pm
by mfavero
If all of the lookup keys are on the source file you can do it all in one transformer. If any of the lookups need to build a key with columns from a previous lookup you need to have more than one transformer.

You may also find it easier to have multiple transformers for rejecting rows and deciding when to do a lookup. The difference in performance of using 1 Transformer or 4 is minimal. I think this is what you want. Also load the Hash into memory cache if it fits.

...............LA........LB...LC........LD.......LE
................|..........|......|..........|..........|
SRC >>> T1>>>|T2-----|>>>T3>>>T4>>>>>>OUT
................:...............:............:..........:
..............REJ1..........REJ2......REJ3....REJ4

(sorry for the wierd notation - never done that before)

Posted: Tue Jun 02, 2009 2:45 pm
by chulett
mfavero wrote:The difference in performance of using 1 Transformer or 4 is minimal.
However, 4 helps support his "conditional" lookup model and can help performance if large numbers of lookups fail. With everything in one transformer, all lookups would be done unconditionally. With multiple transformers, you can do a singleton lookup and then constrain the failures out of the flow to the next transformer and lookup. Lather, rinse, repeat. The impact of doing that could be minimal or it could be significant.