Loading multiple tables - Looking for Best Practices

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
tacitus
Participant
Posts: 13
Joined: Fri Feb 25, 2005 5:46 pm
Location: Boston, MA

Loading multiple tables - Looking for Best Practices

Post by tacitus »

Hi folks,

I did a pass through the forum and couldn't find much on this.

I want to insert/update multiple DB2 tables with related data. I would very much like to avoid having incomplete data. What are the DS best practices for keeping this data in sync if one of the loads fail?

So far beyond a complicated rollback strategy my only obvious solution is to have a table that tracks the status of the related records and then only allows querying if the status is complete. This solution does not get rid of the issue of having incomplete data...it just covers it up.

Any thoughts?
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

If your rule is to not commit any data unless all input criteria are satisfied just playing with the builtin COMMIT options wouldn't help you. It might make sense in your case to write all your computed information to a sequential staging file, where all related records are written on a record together and then writing a DS job to read this file and discard "incomplete" rows.
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

Are you talking about maintaining foreign key relationships? Meaning a parent and children records are processing today, and if any of the related rows fail to load, all need to fail?

If you have a parent fail during processing, you can easily put those parents into an "avoid" lookup so that children for that parent can be pulled out of processing. However, during load as in Oracle, a lot of the high-performance loading techniques bypass the RI checks until after the data is loaded. You get into an after-load failure as indexes and constraints are re-enabled.

Now, what if the parent was an update, but the update fails. Should that mean the children don't load, or can the children load anyway? Are you going to process the data in the order of RI, or are you needing to transactionally process, meaning sort master and and detail by master key and attempt to handle the individual master and then its detail, and if they're okay move them to a load ready set, then move onto the next master row and handled its detail, etc? Note I differentiate master and detail from parent and children.

My point is we need a specification so that we can give a better recommendation.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
tacitus
Participant
Posts: 13
Joined: Fri Feb 25, 2005 5:46 pm
Location: Boston, MA

Post by tacitus »

Hi kcbland,

Don't worry about RI. I am confident we can handle issues around key violations prior to loading. And my concern is not for preparring the data and generating the load files. My concern is that we need to insert/update 3-5 tables with related data and we cannot allow partial data across the tables. For example...if we cannot load one of the detail tables...the parent records should not exist either as far as the upstream data consumers are concerned. So far the only decent solution I can come up with is a colums or table releasing records to the upstream apps.

So what happens when the insert/update to table 3 fails due a connectivity issue...or something like that. If we keep each load separate then the preceding loads are committed and if we keep the loads all in the same job I am pretty certain that even with a commit of zero that this commit restriction is limited to only that stage. So if the failure happens with the 3rd table load, but the first table load is complete (and therefore automatically committed) we have committed some data...but not to all tables.

Is my assumption around committing correct? Or is there some global "all ok" in the job that will tell all stages to commit? This would resolve most possible issues...unless some quirk happened right at the point of commit.

Thanks!
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

Your requirement is that many, maybe millions of rows of data, across multiple tables, have to instantly appear.

Give a large volume of data, there is no database mechanism available to do such a thing, so therefore no ETL tool can accomplish the same. For small volumes, you can group a small set of rows across tables and issue a single commit. But, for high-performance and high-volume loading purposes, this method is poor. Let me enhance that: terrible.

So, either you break your data down into packaged rows across tables and transactionally process them, or you come up with a different method that is also scalable.

One method that is popular is to add a metadata tag column to rows to indicate a batch cycle that the run was loaded under. That batch number has a table with a set of statuses: GOOD, BAD, LOADING. If your cycle succeeds or fails, then there's only 1 row to update when finished. This allows you to commit periodically during a high-performance high-volume load where a commit-at-the-end is not possible. A rollback with periodic commits doesn't get you far, and deletes don't reclaim DB space, so leaving the rows there has the same effect.

The only issue with this approach is that all tables have to be joined to this batch audit table to filter only successfully loaded rows. But, the technique works for 100M rows and 100K rows and 100 rows.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Post Reply