Page 1 of 2

Approach in Filling a Relational Database

Posted: Wed Apr 16, 2008 2:04 am
by Pavan_Yelugula
Hi All
I am working on a ETL Project which is taking data from different sources and filling in a relational Database.
We are looking at 2 Approaches in Handling the Foreign Key constraints on the Target realtional database
1st Approach
--------------
Take the Data from the Source Table and do your business rules and dump in to the corresponding target database table. The Target Database which is oracle will reject the records which have foreign key violations. Get the DBMS Code and reject the records

2nd Approach
--------------
Take the data from the SOurce Table and Look up for the foreign key constraints on the ETL Side and only take the data forward in to the loading step which will be a clean load with no rejections.

I personally pefer the 2nd Approach as i was wondering i will have a better grip on the rejects because imagine if i have 3 foreign keys on the target table and i was wondering i will get the same code ORA-XXX if i let the database handle the rejects. on the other hand if there are 60 tables with different relations if i let database handle the foreign key constraints i will cutting up a hell lot of ETL Work.

ANy help will be really appreciated.

Thanks
Pavan

Posted: Wed Apr 16, 2008 2:14 am
by ray.wurlod
Perform lookups against (local copies of) the target table. Dump the output into a staging area. That way you can run and re-run the referential integrity checking and surrogate key generation phase over and over until you get it right, and you have a good basis for building restart-recovery into your design. Have separate jobs to do the actual loading. These can be fairly "brain-dead" and therefore fast.

Posted: Wed Apr 16, 2008 2:28 am
by bkumar103
Hey Pawan

I would prefer option 1. The records can be rejected using the reject link and that can be collected in some sequential file. which can be loaded to the database. In the second approach as the no of table increases then the job complexity increases. which will make jobs clumsy and bad performer.

Thanks,
Birendra

Posted: Wed Apr 16, 2008 3:31 am
by Pavan_Yelugula
Hi Ray
Is there some reason for advocating the 2nd Approach? Isn't it a huge over head on the ETL To do all those referential integrity checking for a complex system on the ETL Side

Birendra
With the 1st Approach imagine we have 3 foreign key constraints on the table. If the client is looking for specific information like which particular key is missing? the records from the reject link won't be having that information.

Thanks
Pavan

Posted: Wed Apr 16, 2008 5:37 am
by ray.wurlod
That's part of what ETL is about. I gave the main reason in my earlier post - you retain complete control, and don't update the target table until you're ready to do so.

If your data volumes are large, you can use the "bucket file" to reduce the volume of rows needed from the target. Essentially this involves capturing the keys that occur in source and save those to a temporary ("bucket") table then, when it becomes time to select rows from the target, you do an inner join with the "bucket" table.

Posted: Thu Apr 17, 2008 12:56 am
by psdhali
Hello Ray,

Imagine if there are 100 tables with each table linked to say 10 or more with refrential integrity constraints. In this case if we do it in Datastage, and if in future we add one more table, which will link to those 10 + tables, we will need to recode / enhance all existing interfaces.

On other end, if we let Database do it, we just need to sequence things properly. Anyway , since we cannot use bulk loader as other front ends will be using the database (and we cannot disable contraints, triggers etc) we will be using OCI stage, so assuming data quality will not be too bad, time difference between loads with approach 1 and approach 2( only database load time) should be negligible.

However in approach 1, both development, maintenance and future changes/additions to database will be really costly.

Regards
Preetinder

Posted: Thu Apr 17, 2008 12:58 am
by psdhali
Hello Ray,

Imagine if there are 100 tables with each table linked to say 10 or more with refrential integrity constraints. In this case if we do it in Datastage, and if in future we add one more table, which will link to those 10 + tables, we will need to recode / enhance all existing interfaces.

On other end, if we let Database do it, we just need to sequence things properly. Anyway , since we cannot use bulk loader as other front ends will be using the database (and we cannot disable contraints, triggers etc) we will be using OCI stage, so assuming data quality will not be too bad, time difference between loads with approach 1 and approach 2( only database load time) should be negligible.

However in approach 1, both development, maintenance and future changes/additions to database will be really costly.

Regards
Preetinder

Posted: Thu Apr 17, 2008 3:10 am
by ray.wurlod
If you are systematic about it, it's not that great an obstacle.
I've been doing exactly that for 12 years with DataStage.
And it means that, where the transformation logic differs even slightly, you don't have a problem.
Or if the rules change later, which they have a nasty habit of doing.

Posted: Thu Apr 17, 2008 3:54 am
by dsexplorer
Hi Ray
Correct me if i am wrong. you are saying that push the Referential Integrity on the Database and be systematic in the way you fill the Primary Tables and go ahead with filling the child tables

Same Pit falls
Our POC on pushing referential integrity checks on the database doesn't seems to be working when we are having multiple foreign keys on the table. The DBMS Code coming out from the Oracle Table Loading Link is just coming out with ORA-229 but not really specifying which foreign key is violated
I am able to get more information from the DS Logs which are having the oracle constraint names but it will be a tedious to dig the log and attach it to the reject reason column
Even me going much deeper in to the Log isn't giving me more information when there are multiple foreign keys.

2nd main concern
we have a requirment where the client wants to process the reject data before we go forward i am wondering our 2nd approach i will be capturing the rejects in the target format which i cann't use it for the next day processing

Thanks
Pavan

Posted: Thu Apr 17, 2008 4:34 am
by ray.wurlod
Welcome aboard. I can only suggest that your POC is ill-designed.

Grab three hashed files from the target tables. One contains a single record, with a dummy key and the highest current key value. The second is a reverse lookup from the business key (however many columns there are in it) to the surrogate key. This hashed file can also contain the columns that will be used in change detection for slowly changing dimension management. The third is an audit utility - again keyed by the table's business key, it is initialized from the target table and added to as the ETL goes through.

You also need lookup tables (in hashed files is good) so that every possible referential integrity constraint can be checked - and rows that fail these checks can be output in any format you desire.

Posted: Thu Apr 17, 2008 6:37 am
by chulett
dsexplorer wrote:Our POC on pushing referential integrity checks on the database doesn't seems to be working
So are you saying that you are allowing the database to police your RI constraint issues without you making any attempt to ensure you are not violating them before the actual database operation? Bad Idea. :?

As noted, hashed reference lookups for each RI constraint is in order, such that any issues can be logically rejected / handled properly in the job before the database gets involved and upset.

Posted: Thu Apr 17, 2008 9:28 am
by dsexplorer
Hi Chulet
But isn't that exactly what Ray is suggesting me to do

Code: Select all

Hi Ray 
Correct me if i am wrong. you are saying that push the Referential Integrity on the Database and be systematic in the way you fill the Primary Tables and go ahead with filling the child tables 
Thanks
Pavan

Posted: Thu Apr 17, 2008 10:09 am
by chulett
No, I don't believe so. I'm sure he'll clarify when he comes back around to these parts.

Posted: Thu Apr 17, 2008 2:29 pm
by ray.wurlod
I definitely advocate guaranteeing RI within the ETL tool.

Do not rely upon database errors to check RI. For one thing, it's slow.

Posted: Fri Apr 18, 2008 2:32 am
by dsexplorer
Hi Ray,

Can you explain how to make it generic? If for eg the table has to check 5 other tables data before load on some key combinations, should we go to DB dictionary to get the keys, dynamically create sql, then use Universe stage .... and reject missing .

Or do you suggest to actually implement all logic with actually implementing lookups for every key and then when we have another key to change ETL jobs to add that too?

Regards