Lookup design and performance

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
just4geeks
Premium Member
Premium Member
Posts: 644
Joined: Sat Aug 26, 2006 3:59 pm
Location: Mclean, VA

Lookup design and performance

Post by just4geeks »

Hi

I have a scenario in which we need to do a series of lookups on a source file. Some of them are conditional and some of them are straightforward.
Here is more details of scenario:
Source file name: SrcFile1 ( 40 Million records)
Lookup File Names: LkpA, LkpB, LkpC, LkpD, LkpE ( each contains between 100,000 to 500000 records)

Rules:
1. Perform LkupA on each record of SrcFile1. if Lkup A fails on certain records, write those records to an exception file ( ExceptionFile1)
2. Perform LkpB on those record of SrcFile1 which passed LkpA. If LkpB fails on certain records, perform Lookup LkpC on failed source records.If LkpC also fails, write out all failed source records in an exception file ( ExceptionFile2).
3. Perform LkpD in those records which passed condition 2 above. If LkpD fails, write it to an exception file ( ExceptionFile3)
4. LkpD file contains a non key fields ( LkpD.Col1). After performing step 3, we will get Col1 from LkpD which is used to perform looup in LkpE( Col1=LkpE.col1). Perform LkpE in those records which passed condition 3 above. If LkpE fails, write it to an exception file (ExceptionFile4).


I thought of 3 ways to do it:

1. Perform all lookups in one job. Separate transformer Stages will be used to do each lookup and write an exception file in each stage.
2. Perform all lookups in one job. only One transformer stage will be used to perform all lookups and write out exceptions.
3. Design separate jobs for each lookup.

My questions are:
1. Is it possible to do option 2. Can I use just 1 transformer to implement whole logic particularly rule#2and 4?
2. which design is good for optimum performance. Particularly I would like to know between option 1 and 2. I know design option 3 is not optimal as it requires landing of data at each job. In option 1, we can take advantage of inprocess row buffering between stages. But I am not sure if we can do that in option2.

Please share your experience and suggestions.
I will really appreciate them.

Thanks in advance!
Attitude is everything....
mamir_30
Participant
Posts: 5
Joined: Mon Oct 30, 2006 12:50 pm

Re: Lookup design and performance

Post by mamir_30 »

I would say try to use a Join stage instead of lookup. When your input file is 40 mil and lookup is half a mil, Join stage might be a better option.
Newbie
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Server job.
-craig

"You can never have too many knives" -- Logan Nine Fingers
just4geeks
Premium Member
Premium Member
Posts: 644
Joined: Sat Aug 26, 2006 3:59 pm
Location: Mclean, VA

Post by just4geeks »

As Craig said, this is a server job so we can't use Join Stage.
The lookup files are well within a hashed file size limit so using hashed file for lookup is not a problem.

Thanks!
Attitude is everything....
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Moderator: please move to server forum

Option 1 can be done, but Option 2 will give better throughput particularly if inter-process row buffering is enabled.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
mfavero
Premium Member
Premium Member
Posts: 45
Joined: Thu Jan 16, 2003 1:20 pm
Location: Minneapolis

Post by mfavero »

If all of the lookup keys are on the source file you can do it all in one transformer. If any of the lookups need to build a key with columns from a previous lookup you need to have more than one transformer.

You may also find it easier to have multiple transformers for rejecting rows and deciding when to do a lookup. The difference in performance of using 1 Transformer or 4 is minimal. I think this is what you want. Also load the Hash into memory cache if it fits.

...............LA........LB...LC........LD.......LE
................|..........|......|..........|..........|
SRC >>> T1>>>|T2-----|>>>T3>>>T4>>>>>>OUT
................:...............:............:..........:
..............REJ1..........REJ2......REJ3....REJ4

(sorry for the wierd notation - never done that before)
Michael Favero

2852 Humboldt Ave So
Minneapolis, MN 55408
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

mfavero wrote:The difference in performance of using 1 Transformer or 4 is minimal.
However, 4 helps support his "conditional" lookup model and can help performance if large numbers of lookups fail. With everything in one transformer, all lookups would be done unconditionally. With multiple transformers, you can do a singleton lookup and then constrain the failures out of the flow to the next transformer and lookup. Lather, rinse, repeat. The impact of doing that could be minimal or it could be significant.
-craig

"You can never have too many knives" -- Logan Nine Fingers
Post Reply