Approach in Filling a Relational Database

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Pavan_Yelugula
Premium Member
Premium Member
Posts: 133
Joined: Tue Nov 23, 2004 11:24 pm
Location: India

Approach in Filling a Relational Database

Post by Pavan_Yelugula »

Hi All
I am working on a ETL Project which is taking data from different sources and filling in a relational Database.
We are looking at 2 Approaches in Handling the Foreign Key constraints on the Target realtional database
1st Approach
--------------
Take the Data from the Source Table and do your business rules and dump in to the corresponding target database table. The Target Database which is oracle will reject the records which have foreign key violations. Get the DBMS Code and reject the records

2nd Approach
--------------
Take the data from the SOurce Table and Look up for the foreign key constraints on the ETL Side and only take the data forward in to the loading step which will be a clean load with no rejections.

I personally pefer the 2nd Approach as i was wondering i will have a better grip on the rejects because imagine if i have 3 foreign keys on the target table and i was wondering i will get the same code ORA-XXX if i let the database handle the rejects. on the other hand if there are 60 tables with different relations if i let database handle the foreign key constraints i will cutting up a hell lot of ETL Work.

ANy help will be really appreciated.

Thanks
Pavan
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Perform lookups against (local copies of) the target table. Dump the output into a staging area. That way you can run and re-run the referential integrity checking and surrogate key generation phase over and over until you get it right, and you have a good basis for building restart-recovery into your design. Have separate jobs to do the actual loading. These can be fairly "brain-dead" and therefore fast.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
bkumar103
Participant
Posts: 214
Joined: Wed Jul 25, 2007 2:29 am
Location: Chennai

Post by bkumar103 »

Hey Pawan

I would prefer option 1. The records can be rejected using the reject link and that can be collected in some sequential file. which can be loaded to the database. In the second approach as the no of table increases then the job complexity increases. which will make jobs clumsy and bad performer.

Thanks,
Birendra
Birendra
Pavan_Yelugula
Premium Member
Premium Member
Posts: 133
Joined: Tue Nov 23, 2004 11:24 pm
Location: India

Post by Pavan_Yelugula »

Hi Ray
Is there some reason for advocating the 2nd Approach? Isn't it a huge over head on the ETL To do all those referential integrity checking for a complex system on the ETL Side

Birendra
With the 1st Approach imagine we have 3 foreign key constraints on the table. If the client is looking for specific information like which particular key is missing? the records from the reject link won't be having that information.

Thanks
Pavan
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

That's part of what ETL is about. I gave the main reason in my earlier post - you retain complete control, and don't update the target table until you're ready to do so.

If your data volumes are large, you can use the "bucket file" to reduce the volume of rows needed from the target. Essentially this involves capturing the keys that occur in source and save those to a temporary ("bucket") table then, when it becomes time to select rows from the target, you do an inner join with the "bucket" table.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
psdhali
Participant
Posts: 7
Joined: Tue Nov 21, 2006 10:50 am

Post by psdhali »

Hello Ray,

Imagine if there are 100 tables with each table linked to say 10 or more with refrential integrity constraints. In this case if we do it in Datastage, and if in future we add one more table, which will link to those 10 + tables, we will need to recode / enhance all existing interfaces.

On other end, if we let Database do it, we just need to sequence things properly. Anyway , since we cannot use bulk loader as other front ends will be using the database (and we cannot disable contraints, triggers etc) we will be using OCI stage, so assuming data quality will not be too bad, time difference between loads with approach 1 and approach 2( only database load time) should be negligible.

However in approach 1, both development, maintenance and future changes/additions to database will be really costly.

Regards
Preetinder
psdhali
Participant
Posts: 7
Joined: Tue Nov 21, 2006 10:50 am

Post by psdhali »

Hello Ray,

Imagine if there are 100 tables with each table linked to say 10 or more with refrential integrity constraints. In this case if we do it in Datastage, and if in future we add one more table, which will link to those 10 + tables, we will need to recode / enhance all existing interfaces.

On other end, if we let Database do it, we just need to sequence things properly. Anyway , since we cannot use bulk loader as other front ends will be using the database (and we cannot disable contraints, triggers etc) we will be using OCI stage, so assuming data quality will not be too bad, time difference between loads with approach 1 and approach 2( only database load time) should be negligible.

However in approach 1, both development, maintenance and future changes/additions to database will be really costly.

Regards
Preetinder
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

If you are systematic about it, it's not that great an obstacle.
I've been doing exactly that for 12 years with DataStage.
And it means that, where the transformation logic differs even slightly, you don't have a problem.
Or if the rules change later, which they have a nasty habit of doing.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
dsexplorer
Premium Member
Premium Member
Posts: 23
Joined: Wed Apr 16, 2008 5:34 am
Location: Prague,Cz

Post by dsexplorer »

Hi Ray
Correct me if i am wrong. you are saying that push the Referential Integrity on the Database and be systematic in the way you fill the Primary Tables and go ahead with filling the child tables

Same Pit falls
Our POC on pushing referential integrity checks on the database doesn't seems to be working when we are having multiple foreign keys on the table. The DBMS Code coming out from the Oracle Table Loading Link is just coming out with ORA-229 but not really specifying which foreign key is violated
I am able to get more information from the DS Logs which are having the oracle constraint names but it will be a tedious to dig the log and attach it to the reject reason column
Even me going much deeper in to the Log isn't giving me more information when there are multiple foreign keys.

2nd main concern
we have a requirment where the client wants to process the reject data before we go forward i am wondering our 2nd approach i will be capturing the rejects in the target format which i cann't use it for the next day processing

Thanks
Pavan
Regards
DSExplorer
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Welcome aboard. I can only suggest that your POC is ill-designed.

Grab three hashed files from the target tables. One contains a single record, with a dummy key and the highest current key value. The second is a reverse lookup from the business key (however many columns there are in it) to the surrogate key. This hashed file can also contain the columns that will be used in change detection for slowly changing dimension management. The third is an audit utility - again keyed by the table's business key, it is initialized from the target table and added to as the ETL goes through.

You also need lookup tables (in hashed files is good) so that every possible referential integrity constraint can be checked - and rows that fail these checks can be output in any format you desire.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

dsexplorer wrote:Our POC on pushing referential integrity checks on the database doesn't seems to be working
So are you saying that you are allowing the database to police your RI constraint issues without you making any attempt to ensure you are not violating them before the actual database operation? Bad Idea. :?

As noted, hashed reference lookups for each RI constraint is in order, such that any issues can be logically rejected / handled properly in the job before the database gets involved and upset.
-craig

"You can never have too many knives" -- Logan Nine Fingers
dsexplorer
Premium Member
Premium Member
Posts: 23
Joined: Wed Apr 16, 2008 5:34 am
Location: Prague,Cz

Post by dsexplorer »

Hi Chulet
But isn't that exactly what Ray is suggesting me to do

Code: Select all

Hi Ray 
Correct me if i am wrong. you are saying that push the Referential Integrity on the Database and be systematic in the way you fill the Primary Tables and go ahead with filling the child tables 
Thanks
Pavan
Regards
DSExplorer
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

No, I don't believe so. I'm sure he'll clarify when he comes back around to these parts.
-craig

"You can never have too many knives" -- Logan Nine Fingers
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

I definitely advocate guaranteeing RI within the ETL tool.

Do not rely upon database errors to check RI. For one thing, it's slow.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
dsexplorer
Premium Member
Premium Member
Posts: 23
Joined: Wed Apr 16, 2008 5:34 am
Location: Prague,Cz

Post by dsexplorer »

Hi Ray,

Can you explain how to make it generic? If for eg the table has to check 5 other tables data before load on some key combinations, should we go to DB dictionary to get the keys, dynamically create sql, then use Universe stage .... and reject missing .

Or do you suggest to actually implement all logic with actually implementing lookups for every key and then when we have another key to change ETL jobs to add that too?

Regards
Regards
DSExplorer
Post Reply