Hi
I have a scenario in which we need to do a series of lookups on a source file. Some of them are conditional and some of them are straightforward.
Here is more details of scenario:
Source file name: SrcFile1 ( 40 Million records)
Lookup File Names: LkpA, LkpB, LkpC, LkpD, LkpE ( each contains between 100,000 to 500000 records)
Rules:
1. Perform LkupA on each record of SrcFile1. if Lkup A fails on certain records, write those records to an exception file ( ExceptionFile1)
2. Perform LkpB on those record of SrcFile1 which passed LkpA. If LkpB fails on certain records, perform Lookup LkpC on failed source records.If LkpC also fails, write out all failed source records in an exception file ( ExceptionFile2).
3. Perform LkpD in those records which passed condition 2 above. If LkpD fails, write it to an exception file ( ExceptionFile3)
4. LkpD file contains a non key fields ( LkpD.Col1). After performing step 3, we will get Col1 from LkpD which is used to perform looup in LkpE( Col1=LkpE.col1). Perform LkpE in those records which passed condition 3 above. If LkpE fails, write it to an exception file (ExceptionFile4).
I thought of 3 ways to do it:
1. Perform all lookups in one job. Separate transformer Stages will be used to do each lookup and write an exception file in each stage.
2. Perform all lookups in one job. only One transformer stage will be used to perform all lookups and write out exceptions.
3. Design separate jobs for each lookup.
My questions are:
1. Is it possible to do option 2. Can I use just 1 transformer to implement whole logic particularly rule#2and 4?
2. which design is good for optimum performance. Particularly I would like to know between option 1 and 2. I know design option 3 is not optimal as it requires landing of data at each job. In option 1, we can take advantage of inprocess row buffering between stages. But I am not sure if we can do that in option2.
Please share your experience and suggestions.
I will really appreciate them.
Thanks in advance!
Lookup design and performance
Moderators: chulett, rschirm, roy
-
- Premium Member
- Posts: 644
- Joined: Sat Aug 26, 2006 3:59 pm
- Location: Mclean, VA
Lookup design and performance
Attitude is everything....
Re: Lookup design and performance
I would say try to use a Join stage instead of lookup. When your input file is 40 mil and lookup is half a mil, Join stage might be a better option.
Newbie
-
- Premium Member
- Posts: 644
- Joined: Sat Aug 26, 2006 3:59 pm
- Location: Mclean, VA
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
If all of the lookup keys are on the source file you can do it all in one transformer. If any of the lookups need to build a key with columns from a previous lookup you need to have more than one transformer.
You may also find it easier to have multiple transformers for rejecting rows and deciding when to do a lookup. The difference in performance of using 1 Transformer or 4 is minimal. I think this is what you want. Also load the Hash into memory cache if it fits.
...............LA........LB...LC........LD.......LE
................|..........|......|..........|..........|
SRC >>> T1>>>|T2-----|>>>T3>>>T4>>>>>>OUT
................:...............:............:..........:
..............REJ1..........REJ2......REJ3....REJ4
(sorry for the wierd notation - never done that before)
You may also find it easier to have multiple transformers for rejecting rows and deciding when to do a lookup. The difference in performance of using 1 Transformer or 4 is minimal. I think this is what you want. Also load the Hash into memory cache if it fits.
...............LA........LB...LC........LD.......LE
................|..........|......|..........|..........|
SRC >>> T1>>>|T2-----|>>>T3>>>T4>>>>>>OUT
................:...............:............:..........:
..............REJ1..........REJ2......REJ3....REJ4
(sorry for the wierd notation - never done that before)
Michael Favero
2852 Humboldt Ave So
Minneapolis, MN 55408
2852 Humboldt Ave So
Minneapolis, MN 55408
However, 4 helps support his "conditional" lookup model and can help performance if large numbers of lookups fail. With everything in one transformer, all lookups would be done unconditionally. With multiple transformers, you can do a singleton lookup and then constrain the failures out of the flow to the next transformer and lookup. Lather, rinse, repeat. The impact of doing that could be minimal or it could be significant.mfavero wrote:The difference in performance of using 1 Transformer or 4 is minimal.
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers