Lookup design and performance
Posted: Mon Jun 01, 2009 11:21 am
Hi
I have a scenario in which we need to do a series of lookups on a source file. Some of them are conditional and some of them are straightforward.
Here is more details of scenario:
Source file name: SrcFile1 ( 40 Million records)
Lookup File Names: LkpA, LkpB, LkpC, LkpD, LkpE ( each contains between 100,000 to 500000 records)
Rules:
1. Perform LkupA on each record of SrcFile1. if Lkup A fails on certain records, write those records to an exception file ( ExceptionFile1)
2. Perform LkpB on those record of SrcFile1 which passed LkpA. If LkpB fails on certain records, perform Lookup LkpC on failed source records.If LkpC also fails, write out all failed source records in an exception file ( ExceptionFile2).
3. Perform LkpD in those records which passed condition 2 above. If LkpD fails, write it to an exception file ( ExceptionFile3)
4. LkpD file contains a non key fields ( LkpD.Col1). After performing step 3, we will get Col1 from LkpD which is used to perform looup in LkpE( Col1=LkpE.col1). Perform LkpE in those records which passed condition 3 above. If LkpE fails, write it to an exception file (ExceptionFile4).
I thought of 3 ways to do it:
1. Perform all lookups in one job. Separate transformer Stages will be used to do each lookup and write an exception file in each stage.
2. Perform all lookups in one job. only One transformer stage will be used to perform all lookups and write out exceptions.
3. Design separate jobs for each lookup.
My questions are:
1. Is it possible to do option 2. Can I use just 1 transformer to implement whole logic particularly rule#2and 4?
2. which design is good for optimum performance. Particularly I would like to know between option 1 and 2. I know design option 3 is not optimal as it requires landing of data at each job. In option 1, we can take advantage of inprocess row buffering between stages. But I am not sure if we can do that in option2.
Please share your experience and suggestions.
I will really appreciate them.
Thanks in advance!
I have a scenario in which we need to do a series of lookups on a source file. Some of them are conditional and some of them are straightforward.
Here is more details of scenario:
Source file name: SrcFile1 ( 40 Million records)
Lookup File Names: LkpA, LkpB, LkpC, LkpD, LkpE ( each contains between 100,000 to 500000 records)
Rules:
1. Perform LkupA on each record of SrcFile1. if Lkup A fails on certain records, write those records to an exception file ( ExceptionFile1)
2. Perform LkpB on those record of SrcFile1 which passed LkpA. If LkpB fails on certain records, perform Lookup LkpC on failed source records.If LkpC also fails, write out all failed source records in an exception file ( ExceptionFile2).
3. Perform LkpD in those records which passed condition 2 above. If LkpD fails, write it to an exception file ( ExceptionFile3)
4. LkpD file contains a non key fields ( LkpD.Col1). After performing step 3, we will get Col1 from LkpD which is used to perform looup in LkpE( Col1=LkpE.col1). Perform LkpE in those records which passed condition 3 above. If LkpE fails, write it to an exception file (ExceptionFile4).
I thought of 3 ways to do it:
1. Perform all lookups in one job. Separate transformer Stages will be used to do each lookup and write an exception file in each stage.
2. Perform all lookups in one job. only One transformer stage will be used to perform all lookups and write out exceptions.
3. Design separate jobs for each lookup.
My questions are:
1. Is it possible to do option 2. Can I use just 1 transformer to implement whole logic particularly rule#2and 4?
2. which design is good for optimum performance. Particularly I would like to know between option 1 and 2. I know design option 3 is not optimal as it requires landing of data at each job. In option 1, we can take advantage of inprocess row buffering between stages. But I am not sure if we can do that in option2.
Please share your experience and suggestions.
I will really appreciate them.
Thanks in advance!