Page 1 of 1

Hi Performance job design

Posted: Thu Jul 09, 2009 10:30 am
by dsuser_cai
Hi

I have a requirement, I need to create jobs to populate table (insert). here is the table detail.

The source table has nearly 50 columns. and the target table is something like this, col1_pk, col2, col3_fk, col4_fk.... col59. I have indicated the primary key and foreign key with PK and FK. The target has only one primary key and two foreign keys. So I designed a parallel job that extracts data from the source(ODBC Stage) then use a look up stage to look up against the foreign key tables(target side) and then used a merge stage to merge them and then again use a look up to make a target look up to check if the rows exist or not. then generate keys and load the target (insert only). But my manager said this is a poor design and i need to remove or use very less look up stages (ultimately reduce the number of stages), and then design the job. Also the source table dosent have a creationdate or updatedate to extract the delta load, so in my design i used a full extract (assuming the source database are updated). I have one more thought, to remove the look up stage i can write a sql query that gets the foreign keys. the data is going to be huge (in millions). can anybody suggest me a design method or some points where i can make some improvements. Any help would be appreciated.
I used auto partition everywhere.

[/code]

Posted: Thu Jul 09, 2009 12:14 pm
by jcthornton
FYI - Feel free to use returns and whitespace to make your explanation and question easier to read.

For your design, you have a number of questions that you indirectly asked. I am going to try to sketch out the job the way I understand your explanation. Then I'll put down some things to think about.

Unfortunately, ETL is not simply a recipe. There are too many variables and conflicting design parameters. For the best performance, you will need to test different adjustments to find what offers the best balance of performance measurements within your environment.

ODBC -> Copy => Multiple Lookups (using ODBC reference link) => Merge -> Lookup (ODBC reference link) -> Keygen -> Ins ODBC

1. What do you do with Update records? If you drop the update records, move the lookup that checks for the existence of a record to the beginning. Best practice is to retrieve as few records as you can and to do as little processing as possible before discarding a record. In other words - as few records as possible as early as possible.

2. On all the lookups, you have to choose the strategy you will use. Your options include Merge, Join or Lookup stages. You also have to decide between using Database (Sparse and Normal), Lookup File Set, Sequential File, Data Set as your reference/update link(s).

The choices on lookups are significant, and have been discussed many times here. Do a search on the forums and talk with anybody local that has some DataStage experience for pointers. If you have a specific question about strategy and cannot find it in the forums, then post a follow-up, specific question. The answer will probably be "It Depends", but a specific question with more details does have a better chance of being able to be given a usable answer.

3. As I understand your explanation, you can do the same job without the Merge stage. Move the lookups in-line.

4. If you have not done it already, you can also combine the lookups into a single lookup with multiple reference links. Doing this has a potential downside (can be worked around) of not being able to tell which lookup failed if you use a reject link.

Posted: Thu Jul 09, 2009 1:47 pm
by dsuser_cai
Hi jcthornton

Thank you so much. Your suggestions were very useful.

1)What do you do with Update records? This really made me think. I will move the targte look up to the begining, so i will be pulling and processing only few data. Also we are going to add creationdate and updatedate to the source, so with this i can only pull differencial data.

Posted: Thu Jul 09, 2009 3:39 pm
by mansoor_nb
Instead of lookup, one can use the change capture stage to capture the insert and the update records. Also the change capture stage is performance efficient.

Posted: Thu Jul 09, 2009 4:53 pm
by ray.wurlod
Does "Hi" performance mean your job generates a "Hello" message?

Please strive for a professional standard of written English on DSXchange. It helps those whose first language is not English.

Posted: Thu Jul 09, 2009 5:36 pm
by chulett
I guess so, because if you speed the job up you can say "Hello performance!". :wink: