Load Parent and child tables in a single datastage ETL job

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
osuru74
Premium Member
Premium Member
Posts: 5
Joined: Wed Feb 16, 2005 9:45 pm

Load Parent and child tables in a single datastage ETL job

Post by osuru74 »

Hi,
Can anyone help what is the best practice to Load Parent and child data in a single job ? So far we have done in two jobs, first loading master data in one job and then loading detail data in second job.But client needs to load both master and detail data in one job ..and is there any better way of doing it ? Your help is much appreciated ?

Thanks & regards//Subra
miwinter
Participant
Posts: 396
Joined: Thu Jun 22, 2006 7:00 am
Location: England, UK

Post by miwinter »

Are there constraints set up within the DB so that you need to do one before the other (i.e. parent before child)? Give us more information about the requirement and it will be easier to try and assist.
Mark Winter
<i>Nothing appeases a troubled mind more than <b>good</b> music</i>
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

There is a distributed transaction stage or distributed transaction support for Oracle in DataStage 8 where you can save the updates/inserts to more than one table as a single transaction - with the same rollback point. Not really sure how it would work with relational source data but it may be a way to load the parent and child tables simultaneously.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Any idea where one would find this, Vincent? All I've found (so far) in the 8.1 docs is DB2 support and it requires MQ.
-craig

"You can never have too many knives" -- Logan Nine Fingers
osuru74
Premium Member
Premium Member
Posts: 5
Joined: Wed Feb 16, 2005 9:45 pm

Post by osuru74 »

Here are the requirements ...thanks for your response..
(1)We receive source data in TXT format and it contains both master and detail data
(2)We need to load this file into two oracle tables tables, Master_Cust and Detail_Cust. But these two tables have referential integrity defined.
(3) How to write a single datastage ETL job to load the above Txt file into the above tables in Oracle. We are using 7.5.1 Parallel extender and now moving to Version 8.1

Thanks & regards//subra
mfavero
Premium Member
Premium Member
Posts: 45
Joined: Thu Jan 16, 2003 1:20 pm
Location: Minneapolis

Post by mfavero »

In my experience, the best practice has always been to have the ETL resolve referential integrity. The number of jobs is not relevant. You could have 3 jobs for each table, depending on complexity and modularity.

The load images for the related tables would be generated as seq files or datasets and then loaded in the proper order.

Often scripts are used to bulk load tables and often RI is dropped for performance while the load is taking place. Then RI is turned back on when the load is complete. Since the ETL has guaranteed referential integrity, this is safe and sure. You can substitute DS jobs for load scripts which would be functionally equivalent.

If there is a new Oracle utility that loads to parent/child tables simultaneously, you could use a script to call that. I have not heard anything about a new Oracle plug-in which accomplishes the same.
Michael Favero

2852 Humboldt Ave So
Minneapolis, MN 55408
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

As noted, this should be feasible to do in one job as long as you order the links properly to support the RI. It's one place where you seem to have less control over how it all works in a PX job over Server, in Server you can use a single DB stage to keep it all in one transaction and even use 'transaction grouping' across multiple tables if desired.

I've used two OE stages, parent and child, with standard generated Upsert sql in both stages to load both tables at the same time. I didn't have to do anything 'special' for this other than order the links properly, and used the default (non-specified) array size for the inserts, whatever that turns out to be. I also added $APT_ORAUPSERT_COMMIT_ROW_INTERVAL = 0 to (hopefully) get an 'all or nothing' load. Need to really build a test case or two to ensure that is, in fact, happening.

So... what had you tried before posting, anything? What issues did you have?
-craig

"You can never have too many knives" -- Logan Nine Fingers
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Funny. I made the previous post after having built a small test job to load 10 records into related work tables and then running it once. That was the only time it ran without issue. :lol:

As I manipulated the data, or truncated the tables for a reload, the parent table would load fine but various (and seemingly ever changing) child records would be rejected. Keep in mind this was a pretty plain vanilla load (stage defaults) with our default four node config and auto partitioning... which contributed to the issue.

I found I had to make two changes to get this to load reliably - A) set partioning in both OE stages to 'hash' based on the parent id key I had so parent/child records stayed together regardless of node and B) set the Insert Array Size to 1. Not ideal from a 'load performance' standpoint but that's what made it work for me. IMHO, you should stick with the two job approach if tuning each part of this load to its utmost bestest performance (whatever that means in an ETL context) is your goal.
-craig

"You can never have too many knives" -- Logan Nine Fingers
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Wow... take all this with a grain of salt. What I thought was working reliably is just more reliable but does not work 100% of the time. Going to keep experimenting to see what else it takes, suspect I may need to knock it down to running on one node but even that may not be the ultimate answer since each OE stage is a discrete connection. The 'ultimate' answer might very well be to use two PX jobs. Or a Server job. :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
JRodriguez
Premium Member
Premium Member
Posts: 425
Joined: Sat Nov 19, 2005 9:26 am
Location: New York City
Contact:

Post by JRodriguez »

Guys,

Unfortunatelly for Oracle shops, the Distributed transaction stage only support DB2, Oracle is not supported yet, even the documentation clearly mention that Oracle is supported in more than one place ........


vmcburney wrote:There is a distributed transaction stage or distributed transaction support for Oracle in DataStage 8 where you can save the updates/inserts to more than one table as a single transaction - with the same rollback point. Not really sure how it would work with relational source data but it may be a way to load the parent and child tables simultaneously.
Julio Rodriguez
ETL Developer by choice

"Sure we have lots of reasons for being rude - But no excuses
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

For the record, I've officially thrown in the towel on this. Nothing I can think of - including running on a single node and setting the commit row interval to 1 - makes this work. Nothing. :(

Keep in mind this is a 7.5.2 install on Linux with Oracle 10g, no idea if 8.x or other changes would make this work. All I can say for certain here is that I can't do this in one PX job in this environment. If I had to guess why it would be the simple fact that you are required to use two stages and thus two discrete and unrelated as far as Oracle knows connections.

Farg.

FYI: I've done this in Server, so I know it is possible there.
-craig

"You can never have too many knives" -- Logan Nine Fingers
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Just for grins, I tried one last thing but it falls squarely into what I would call a "cheese" solution. I added a Sort stage before the child OE stage in such a way as to delay their arrival until after the parents are home. And that works. Sure. I still need to run on one node, etc etc etc so I may as well have two jobs at that point but, hey... it "works". :roll:
-craig

"You can never have too many knives" -- Logan Nine Fingers
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

The only stage I've seen that supports transaction grouping is the ODBC stage in server jobs.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

OCI in Server does as well.
-craig

"You can never have too many knives" -- Logan Nine Fingers
Post Reply