Approach in Filling a Relational Database
Moderators: chulett, rschirm, roy
-
- Premium Member
- Posts: 133
- Joined: Tue Nov 23, 2004 11:24 pm
- Location: India
Approach in Filling a Relational Database
Hi All
I am working on a ETL Project which is taking data from different sources and filling in a relational Database.
We are looking at 2 Approaches in Handling the Foreign Key constraints on the Target realtional database
1st Approach
--------------
Take the Data from the Source Table and do your business rules and dump in to the corresponding target database table. The Target Database which is oracle will reject the records which have foreign key violations. Get the DBMS Code and reject the records
2nd Approach
--------------
Take the data from the SOurce Table and Look up for the foreign key constraints on the ETL Side and only take the data forward in to the loading step which will be a clean load with no rejections.
I personally pefer the 2nd Approach as i was wondering i will have a better grip on the rejects because imagine if i have 3 foreign keys on the target table and i was wondering i will get the same code ORA-XXX if i let the database handle the rejects. on the other hand if there are 60 tables with different relations if i let database handle the foreign key constraints i will cutting up a hell lot of ETL Work.
ANy help will be really appreciated.
Thanks
Pavan
I am working on a ETL Project which is taking data from different sources and filling in a relational Database.
We are looking at 2 Approaches in Handling the Foreign Key constraints on the Target realtional database
1st Approach
--------------
Take the Data from the Source Table and do your business rules and dump in to the corresponding target database table. The Target Database which is oracle will reject the records which have foreign key violations. Get the DBMS Code and reject the records
2nd Approach
--------------
Take the data from the SOurce Table and Look up for the foreign key constraints on the ETL Side and only take the data forward in to the loading step which will be a clean load with no rejections.
I personally pefer the 2nd Approach as i was wondering i will have a better grip on the rejects because imagine if i have 3 foreign keys on the target table and i was wondering i will get the same code ORA-XXX if i let the database handle the rejects. on the other hand if there are 60 tables with different relations if i let database handle the foreign key constraints i will cutting up a hell lot of ETL Work.
ANy help will be really appreciated.
Thanks
Pavan
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Perform lookups against (local copies of) the target table. Dump the output into a staging area. That way you can run and re-run the referential integrity checking and surrogate key generation phase over and over until you get it right, and you have a good basis for building restart-recovery into your design. Have separate jobs to do the actual loading. These can be fairly "brain-dead" and therefore fast.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Hey Pawan
I would prefer option 1. The records can be rejected using the reject link and that can be collected in some sequential file. which can be loaded to the database. In the second approach as the no of table increases then the job complexity increases. which will make jobs clumsy and bad performer.
Thanks,
Birendra
I would prefer option 1. The records can be rejected using the reject link and that can be collected in some sequential file. which can be loaded to the database. In the second approach as the no of table increases then the job complexity increases. which will make jobs clumsy and bad performer.
Thanks,
Birendra
Birendra
-
- Premium Member
- Posts: 133
- Joined: Tue Nov 23, 2004 11:24 pm
- Location: India
Hi Ray
Is there some reason for advocating the 2nd Approach? Isn't it a huge over head on the ETL To do all those referential integrity checking for a complex system on the ETL Side
Birendra
With the 1st Approach imagine we have 3 foreign key constraints on the table. If the client is looking for specific information like which particular key is missing? the records from the reject link won't be having that information.
Thanks
Pavan
Is there some reason for advocating the 2nd Approach? Isn't it a huge over head on the ETL To do all those referential integrity checking for a complex system on the ETL Side
Birendra
With the 1st Approach imagine we have 3 foreign key constraints on the table. If the client is looking for specific information like which particular key is missing? the records from the reject link won't be having that information.
Thanks
Pavan
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
That's part of what ETL is about. I gave the main reason in my earlier post - you retain complete control, and don't update the target table until you're ready to do so.
If your data volumes are large, you can use the "bucket file" to reduce the volume of rows needed from the target. Essentially this involves capturing the keys that occur in source and save those to a temporary ("bucket") table then, when it becomes time to select rows from the target, you do an inner join with the "bucket" table.
If your data volumes are large, you can use the "bucket file" to reduce the volume of rows needed from the target. Essentially this involves capturing the keys that occur in source and save those to a temporary ("bucket") table then, when it becomes time to select rows from the target, you do an inner join with the "bucket" table.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Hello Ray,
Imagine if there are 100 tables with each table linked to say 10 or more with refrential integrity constraints. In this case if we do it in Datastage, and if in future we add one more table, which will link to those 10 + tables, we will need to recode / enhance all existing interfaces.
On other end, if we let Database do it, we just need to sequence things properly. Anyway , since we cannot use bulk loader as other front ends will be using the database (and we cannot disable contraints, triggers etc) we will be using OCI stage, so assuming data quality will not be too bad, time difference between loads with approach 1 and approach 2( only database load time) should be negligible.
However in approach 1, both development, maintenance and future changes/additions to database will be really costly.
Regards
Preetinder
Imagine if there are 100 tables with each table linked to say 10 or more with refrential integrity constraints. In this case if we do it in Datastage, and if in future we add one more table, which will link to those 10 + tables, we will need to recode / enhance all existing interfaces.
On other end, if we let Database do it, we just need to sequence things properly. Anyway , since we cannot use bulk loader as other front ends will be using the database (and we cannot disable contraints, triggers etc) we will be using OCI stage, so assuming data quality will not be too bad, time difference between loads with approach 1 and approach 2( only database load time) should be negligible.
However in approach 1, both development, maintenance and future changes/additions to database will be really costly.
Regards
Preetinder
Hello Ray,
Imagine if there are 100 tables with each table linked to say 10 or more with refrential integrity constraints. In this case if we do it in Datastage, and if in future we add one more table, which will link to those 10 + tables, we will need to recode / enhance all existing interfaces.
On other end, if we let Database do it, we just need to sequence things properly. Anyway , since we cannot use bulk loader as other front ends will be using the database (and we cannot disable contraints, triggers etc) we will be using OCI stage, so assuming data quality will not be too bad, time difference between loads with approach 1 and approach 2( only database load time) should be negligible.
However in approach 1, both development, maintenance and future changes/additions to database will be really costly.
Regards
Preetinder
Imagine if there are 100 tables with each table linked to say 10 or more with refrential integrity constraints. In this case if we do it in Datastage, and if in future we add one more table, which will link to those 10 + tables, we will need to recode / enhance all existing interfaces.
On other end, if we let Database do it, we just need to sequence things properly. Anyway , since we cannot use bulk loader as other front ends will be using the database (and we cannot disable contraints, triggers etc) we will be using OCI stage, so assuming data quality will not be too bad, time difference between loads with approach 1 and approach 2( only database load time) should be negligible.
However in approach 1, both development, maintenance and future changes/additions to database will be really costly.
Regards
Preetinder
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
If you are systematic about it, it's not that great an obstacle.
I've been doing exactly that for 12 years with DataStage.
And it means that, where the transformation logic differs even slightly, you don't have a problem.
Or if the rules change later, which they have a nasty habit of doing.
I've been doing exactly that for 12 years with DataStage.
And it means that, where the transformation logic differs even slightly, you don't have a problem.
Or if the rules change later, which they have a nasty habit of doing.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Premium Member
- Posts: 23
- Joined: Wed Apr 16, 2008 5:34 am
- Location: Prague,Cz
Hi Ray
Correct me if i am wrong. you are saying that push the Referential Integrity on the Database and be systematic in the way you fill the Primary Tables and go ahead with filling the child tables
Same Pit falls
Our POC on pushing referential integrity checks on the database doesn't seems to be working when we are having multiple foreign keys on the table. The DBMS Code coming out from the Oracle Table Loading Link is just coming out with ORA-229 but not really specifying which foreign key is violated
I am able to get more information from the DS Logs which are having the oracle constraint names but it will be a tedious to dig the log and attach it to the reject reason column
Even me going much deeper in to the Log isn't giving me more information when there are multiple foreign keys.
2nd main concern
we have a requirment where the client wants to process the reject data before we go forward i am wondering our 2nd approach i will be capturing the rejects in the target format which i cann't use it for the next day processing
Thanks
Pavan
Correct me if i am wrong. you are saying that push the Referential Integrity on the Database and be systematic in the way you fill the Primary Tables and go ahead with filling the child tables
Same Pit falls
Our POC on pushing referential integrity checks on the database doesn't seems to be working when we are having multiple foreign keys on the table. The DBMS Code coming out from the Oracle Table Loading Link is just coming out with ORA-229 but not really specifying which foreign key is violated
I am able to get more information from the DS Logs which are having the oracle constraint names but it will be a tedious to dig the log and attach it to the reject reason column
Even me going much deeper in to the Log isn't giving me more information when there are multiple foreign keys.
2nd main concern
we have a requirment where the client wants to process the reject data before we go forward i am wondering our 2nd approach i will be capturing the rejects in the target format which i cann't use it for the next day processing
Thanks
Pavan
Regards
DSExplorer
DSExplorer
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Welcome aboard. I can only suggest that your POC is ill-designed.
Grab three hashed files from the target tables. One contains a single record, with a dummy key and the highest current key value. The second is a reverse lookup from the business key (however many columns there are in it) to the surrogate key. This hashed file can also contain the columns that will be used in change detection for slowly changing dimension management. The third is an audit utility - again keyed by the table's business key, it is initialized from the target table and added to as the ETL goes through.
You also need lookup tables (in hashed files is good) so that every possible referential integrity constraint can be checked - and rows that fail these checks can be output in any format you desire.
Grab three hashed files from the target tables. One contains a single record, with a dummy key and the highest current key value. The second is a reverse lookup from the business key (however many columns there are in it) to the surrogate key. This hashed file can also contain the columns that will be used in change detection for slowly changing dimension management. The third is an audit utility - again keyed by the table's business key, it is initialized from the target table and added to as the ETL goes through.
You also need lookup tables (in hashed files is good) so that every possible referential integrity constraint can be checked - and rows that fail these checks can be output in any format you desire.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
So are you saying that you are allowing the database to police your RI constraint issues without you making any attempt to ensure you are not violating them before the actual database operation? Bad Idea.dsexplorer wrote:Our POC on pushing referential integrity checks on the database doesn't seems to be working
As noted, hashed reference lookups for each RI constraint is in order, such that any issues can be logically rejected / handled properly in the job before the database gets involved and upset.
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
-
- Premium Member
- Posts: 23
- Joined: Wed Apr 16, 2008 5:34 am
- Location: Prague,Cz
Hi Chulet
But isn't that exactly what Ray is suggesting me to do
Thanks
Pavan
But isn't that exactly what Ray is suggesting me to do
Code: Select all
Hi Ray
Correct me if i am wrong. you are saying that push the Referential Integrity on the Database and be systematic in the way you fill the Primary Tables and go ahead with filling the child tables
Pavan
Regards
DSExplorer
DSExplorer
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
-
- Premium Member
- Posts: 23
- Joined: Wed Apr 16, 2008 5:34 am
- Location: Prague,Cz
Hi Ray,
Can you explain how to make it generic? If for eg the table has to check 5 other tables data before load on some key combinations, should we go to DB dictionary to get the keys, dynamically create sql, then use Universe stage .... and reject missing .
Or do you suggest to actually implement all logic with actually implementing lookups for every key and then when we have another key to change ETL jobs to add that too?
Regards
Can you explain how to make it generic? If for eg the table has to check 5 other tables data before load on some key combinations, should we go to DB dictionary to get the keys, dynamically create sql, then use Universe stage .... and reject missing .
Or do you suggest to actually implement all logic with actually implementing lookups for every key and then when we have another key to change ETL jobs to add that too?
Regards
Regards
DSExplorer
DSExplorer