Splitting of transformation and loading jobs.

gsherry1 · Post by **gsherry1** » Mon Apr 03, 2006 4:38 pm

Hello Forum,

I've seen in DataStage and other ETL tools, the developer's explicitely split the transformation of load files and the loading of load files to tables into separate jobs.

The main argument I've heard in favor of this, is that it simplifies the recoverability of transformation errors by leaving the DBMS out of the picture, and also eliminates timeout type errors that occur when a loader is spawned and does not receive data for several minutes due to a bottleneck upstream.

In my opinion there are a couple of downsides. There is usually a performance drop due to reduced pipeline parallelism, and increased development effort due to the overhead of creating/maintaining a separate job.

I am interested in whether the splitting of transformation and load jobs is justified if the transfomation logic is light. What the DataStage best practices surrounding this topic?

Thanks,

Greg

ray.wurlod · Post by **ray.wurlod** » Mon Apr 03, 2006 5:08 pm

The vendor's "Best Practices" class used to teach that staging prior to loading was best practice. Primarily, as you say, for restartability, and because you can do all the transformation logic even if the target database is unavailable. I have not seen this class for some time, so do not know if that's still the advice they give.

To my mind, it's now a matter of how you want to build your recovery logic, and how likely it is that your hardware/software will fail. If you have dual (or better) everything, then a strategy that relies on the system being 100% available seems a more sensible choice.

Jim Grover · Post by **Jim Grover** » Mon Jun 19, 2006 9:24 pm

I'm curious of what others use for a "rule of thumb". For instance, we have lots of ETL that sources less than 100,000 records (expected to maybe reach 700,000 at some point). If there are 10 lookups and simple transformations, would you have a job to create lookup datasets, another to do the lookups and transformations and a third to load the data? Having all in one job runs in less than two minutes on a development system with 2 CPUs, PROD is 4 CPUs. I know the practices are different everywhere but I'm curious about general rules/guidelines and a summary of reasons. We are using Oracle 10g with 30G of RAM so number of tables in use is not really a big player.

kcbland · Post by **kcbland** » Mon Jun 19, 2006 10:04 pm

Depends....

If you have a table that is used as a reference in many different sets of transformations, do you want each job to pull that same table from the database every time or would you prefer a predecessor job pull it into a lookup dataset?

What about the situation where your average input is a daily delta of 100K rows, your target table is 100M rows? Would you pull all 100M rows into a Merge stage just to throw away 99.9% of the rows to determine what's an insert and what's an update? Would it make more sense to first scan the 100K rows for natural keys, load them into a work table on the target side, and then inner-join that work table to the target table and just pull the maximum 100K rows into the Merge operation?

How about the situation where you didn't stage your load data to insert ready load files and update ready load files per table. During transformation, you're inserting and updating within the execution stream. If the job aborts for some easily to rectify reason, say extent or temp space, you'll have to remove the loaded rows from the database using a delete operation if you're past your commit points. Otherwise, restarting that failed job now has other complications because of partially loaded data from the failed run.

I think the most important architectural reason not to immediately load data during transformation, and instead use a "working" area to hold transformed results, is the point when the transformation errors aren't visible until subsequent tables are transformed. Consider a 3 table relationship: a dimension, its parent dimension, and a fact table. During transformation of the parent dimension, something goes awry and an abnormal number of the parents go into an exception file. The child dimension also fails to transform the related rows, because the foreign key lookup is not satisfied. Now, when the fact transforms, even if the source data is perfect, the dimensions are not found in the surrogate foreign key substitution. If the transformation of a series of tables are point of a single larger integrated ETL process, there's the opportunity to measure the quality of transformation and stop prior to affecting the target tables. If you "load as you go" thru the dimensions and on to the facts, you've already modified your target tables on your way to the facts.

Deferred loading files or even a scratch copy of the target tables within an ETL schema are methods to first assemble all of the target data, transforming as much as possible, and then when tolerances are met mass-loading the target tables. Sequential files are preferred because they can be archived, as well as easily viewed, whereas a database schema is only good until the next run. Ralph Kimball's ETL Toolkit talks a lot about these concepts.

One last thing, as a interesting thing to impart. Since I'm a firm believer in staging methods, a couple of years ago a customer of mine with a 2.4+ TB warehouse crashed on Christmas day. The DB was recovered but the last full backup was 2 weeks old. In order to catch up the DB, you could either reprocess every day (at 5 hours per daily run) and catch up in a couple of weeks, or simply take each archived load set from every day and apply the inserts and updates (part of our design solution anyway, the controls already existed) until the database was caught up. The next day it was up and current as of 2 days prior, then only 2 days of daily processing was required.

Hope this helps.

Jim Grover · Post by **Jim Grover** » Tue Jun 20, 2006 6:31 am

Thanks Kenneth, lots of good points. It would be great to hear from others. I've been leaning towards an approach like the below for some time:
- process common lookups used in multiple places into a lookup dataset once
- for dimensions with very few records (under 10,000), process and load in one job; especially if there are only a couple of lookups for descriptions that are "nice to have" but not required
- for dimensions with "critical" lookups, transform in one job and load in a separate job; with the load happening dependent on number of errors kicked out by the transformer job
- for smaller delta runs for facts (maybe up to 100,000 or 250,000 rows), lookup to tables where the first bullet above does not apply and output to an update and an insert dataset; each loaded by separate jobs
- for larger and slower running delta runs for facts, have a separate job create lookup datasets for all the lookups and transform and load via jobs as in the previous bullet

This, of course, is a general guideline but any comments/suggestions?

vmcburney · Post by **vmcburney** » Tue Jun 20, 2006 6:11 pm

On my last project we took the approach of loading smaller volumes directly and staging the larger volumes to load ready parallel datasets. You might think that splitting a job into two (prepare and load) or three (prepare, insert and update) might result in more development time but it becomes easier to debug problems and errors and to optimise the jobs and the load/insert/delete/update jobs are often reusable for multiple prepare jobs. So it can leadto shorter dev times. Especially as the load jobs are extremely easy to write and test.

Any development overhead of having the extra load jobs is returned with interest in production where administration and support becomes significantly easier and in end user satisfaction as rollback and recovery times become a lot shorter.

kcbland · Post by **kcbland** » Tue Jun 20, 2006 8:35 pm

kcbland wrote:What about the situation where your average input is a daily delta of 100K rows, your target table is 100M rows? Would you pull all 100M rows into a Merge stage just to throw away 99.9% of the rows to determine what's an insert and what's an update? Would it make more sense to first scan the 100K rows for natural keys, load them into a work table on the target side, and then inner-join that work table to the target table and just pull the maximum 100K rows into the Merge operation?

Not to quote myself, but...

This is where ELT versus ETL seems so powerful. Because the source data (E) is loaded (L) into a target database work table and then transformed (T), the massive Merge operation is natural highly efficient because it's simply a database inner join. We can effect the same result in "ETL" using the method I quoted. Thus, the ELT folks lose their compelling performance argument, and now just fall back on database parallel processing. We defeat that argument in Server with multiple job instances, and crush the argument in PX will massively scalable distributed processing across nodes that aren't even within the database environment, whereas Sunopsis is stuck with its generated SQL/DML within the database engine.