Best practice for triggering data load

stiantok · Post by **stiantok** » Thu Oct 23, 2014 1:01 pm

I guess my scenario is pretty common: I have a source database which has a log table that indicates that table x, y, z (and so on) is ready to be loaded into my DWH. The tables get ready at different times during the night, and I would like to be able to 1) load them one-by-one as they get ready, or 2) load all of them when all tables are ready.

Does anyone have any best practice/smart solutions for doing this kind of "triggering"?

PS: I use one generic job (multiple instance) for staging all the source tables.

PaulVL · Post by **PaulVL** » Thu Oct 23, 2014 2:59 pm

That really depends on the quantity of tables you are talking about.

As an administrator, I would say trickle load.

If data is ready to load, do it.
Don't wait for ALL tables to be ready.

As a developper or subject matter expert of your data flow, you have to understand what the impact is IF one table never gets that Ready to Load flag. You could have updated 99/100 tables... and never get that last one.

You now have a potential big problem, and might have to backtrack those 99 to their prior state (only you and your data architect know that).

Now, I would create a job that runs every 20 mins (or timeframe that is up to you), to poll that status table and kick off any jobs for tables that are now ready. You'll need to flag an entry as "being loaded" so you don't duplicate your job runs. Then flag it as Loaded once done.

You might also want that table to contain a datestamp for your successful load time. You'll at least be able to see when the last time that table was loaded via batch.

Ultimately.... You have to understand the flow of your data to be able to answer the all at once, or trickle load.

(All at once puts a lot of burden on the ETL DataStage servers vs trickle loading.)