Page 1 of 1

Load Parent and child tables in a single datastage ETL job

Posted: Sun May 31, 2009 12:31 pm
by osuru74
Hi,
Can anyone help what is the best practice to Load Parent and child data in a single job ? So far we have done in two jobs, first loading master data in one job and then loading detail data in second job.But client needs to load both master and detail data in one job ..and is there any better way of doing it ? Your help is much appreciated ?

Thanks & regards//Subra

Posted: Mon Jun 01, 2009 3:35 am
by miwinter
Are there constraints set up within the DB so that you need to do one before the other (i.e. parent before child)? Give us more information about the requirement and it will be easier to try and assist.

Posted: Mon Jun 01, 2009 9:01 am
by vmcburney
There is a distributed transaction stage or distributed transaction support for Oracle in DataStage 8 where you can save the updates/inserts to more than one table as a single transaction - with the same rollback point. Not really sure how it would work with relational source data but it may be a way to load the parent and child tables simultaneously.

Posted: Mon Jun 01, 2009 9:41 am
by chulett
Any idea where one would find this, Vincent? All I've found (so far) in the 8.1 docs is DB2 support and it requires MQ.

Posted: Tue Jun 02, 2009 12:47 pm
by osuru74
Here are the requirements ...thanks for your response..
(1)We receive source data in TXT format and it contains both master and detail data
(2)We need to load this file into two oracle tables tables, Master_Cust and Detail_Cust. But these two tables have referential integrity defined.
(3) How to write a single datastage ETL job to load the above Txt file into the above tables in Oracle. We are using 7.5.1 Parallel extender and now moving to Version 8.1

Thanks & regards//subra

Posted: Tue Jun 02, 2009 1:25 pm
by mfavero
In my experience, the best practice has always been to have the ETL resolve referential integrity. The number of jobs is not relevant. You could have 3 jobs for each table, depending on complexity and modularity.

The load images for the related tables would be generated as seq files or datasets and then loaded in the proper order.

Often scripts are used to bulk load tables and often RI is dropped for performance while the load is taking place. Then RI is turned back on when the load is complete. Since the ETL has guaranteed referential integrity, this is safe and sure. You can substitute DS jobs for load scripts which would be functionally equivalent.

If there is a new Oracle utility that loads to parent/child tables simultaneously, you could use a script to call that. I have not heard anything about a new Oracle plug-in which accomplishes the same.

Posted: Wed Jun 03, 2009 10:50 am
by chulett
As noted, this should be feasible to do in one job as long as you order the links properly to support the RI. It's one place where you seem to have less control over how it all works in a PX job over Server, in Server you can use a single DB stage to keep it all in one transaction and even use 'transaction grouping' across multiple tables if desired.

I've used two OE stages, parent and child, with standard generated Upsert sql in both stages to load both tables at the same time. I didn't have to do anything 'special' for this other than order the links properly, and used the default (non-specified) array size for the inserts, whatever that turns out to be. I also added $APT_ORAUPSERT_COMMIT_ROW_INTERVAL = 0 to (hopefully) get an 'all or nothing' load. Need to really build a test case or two to ensure that is, in fact, happening.

So... what had you tried before posting, anything? What issues did you have?

Posted: Wed Jun 03, 2009 12:43 pm
by chulett
Funny. I made the previous post after having built a small test job to load 10 records into related work tables and then running it once. That was the only time it ran without issue. :lol:

As I manipulated the data, or truncated the tables for a reload, the parent table would load fine but various (and seemingly ever changing) child records would be rejected. Keep in mind this was a pretty plain vanilla load (stage defaults) with our default four node config and auto partitioning... which contributed to the issue.

I found I had to make two changes to get this to load reliably - A) set partioning in both OE stages to 'hash' based on the parent id key I had so parent/child records stayed together regardless of node and B) set the Insert Array Size to 1. Not ideal from a 'load performance' standpoint but that's what made it work for me. IMHO, you should stick with the two job approach if tuning each part of this load to its utmost bestest performance (whatever that means in an ETL context) is your goal.

Posted: Wed Jun 03, 2009 1:31 pm
by chulett
Wow... take all this with a grain of salt. What I thought was working reliably is just more reliable but does not work 100% of the time. Going to keep experimenting to see what else it takes, suspect I may need to knock it down to running on one node but even that may not be the ultimate answer since each OE stage is a discrete connection. The 'ultimate' answer might very well be to use two PX jobs. Or a Server job. :wink:

Posted: Wed Jun 03, 2009 1:55 pm
by JRodriguez
Guys,

Unfortunatelly for Oracle shops, the Distributed transaction stage only support DB2, Oracle is not supported yet, even the documentation clearly mention that Oracle is supported in more than one place ........


vmcburney wrote:There is a distributed transaction stage or distributed transaction support for Oracle in DataStage 8 where you can save the updates/inserts to more than one table as a single transaction - with the same rollback point. Not really sure how it would work with relational source data but it may be a way to load the parent and child tables simultaneously.

Posted: Wed Jun 03, 2009 3:26 pm
by chulett
For the record, I've officially thrown in the towel on this. Nothing I can think of - including running on a single node and setting the commit row interval to 1 - makes this work. Nothing. :(

Keep in mind this is a 7.5.2 install on Linux with Oracle 10g, no idea if 8.x or other changes would make this work. All I can say for certain here is that I can't do this in one PX job in this environment. If I had to guess why it would be the simple fact that you are required to use two stages and thus two discrete and unrelated as far as Oracle knows connections.

Farg.

FYI: I've done this in Server, so I know it is possible there.

Posted: Wed Jun 03, 2009 4:03 pm
by chulett
Just for grins, I tried one last thing but it falls squarely into what I would call a "cheese" solution. I added a Sort stage before the child OE stage in such a way as to delay their arrival until after the parents are home. And that works. Sure. I still need to run on one node, etc etc etc so I may as well have two jobs at that point but, hey... it "works". :roll:

Posted: Wed Jun 03, 2009 4:09 pm
by ray.wurlod
The only stage I've seen that supports transaction grouping is the ODBC stage in server jobs.

Posted: Wed Jun 03, 2009 4:16 pm
by chulett
OCI in Server does as well.