ETL auditing
Moderators: chulett, rschirm, roy
ETL auditing
Hi,
Just wonder how do you implement ETL auditing in your company? In our company, we have some high level auditing, like how many input rows, how many output rows and how many rejected rows. In each table, we also have DW_UPDATE_BATCH_KEY, DW_CREATE_BATCH_KEY auditing fields. But business wants more detail level aduiting, they want to know which individual input row end up ?
Can someone shed some light on that?
Thanks a lot,
Carol
Just wonder how do you implement ETL auditing in your company? In our company, we have some high level auditing, like how many input rows, how many output rows and how many rejected rows. In each table, we also have DW_UPDATE_BATCH_KEY, DW_CREATE_BATCH_KEY auditing fields. But business wants more detail level aduiting, they want to know which individual input row end up ?
Can someone shed some light on that?
Thanks a lot,
Carol
Re: ETL auditing
Can you shed some light on exactly what this might mean - "Which individual row end up"?yiminghu wrote:But business wants more detail level aduiting, they want to know which individual input row end up ?
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
Usually you have the originating key information, or at least a unique identifier stored with the data. In the case of surrogate key assignment, this is an absolute minimum. As far as auditing, you can perform reconciliations to show that all source rows are loaded via count comparisons based on natural keys in the target against the source keys, but that can be tricky if there's sophisticated rules for which rows are allow to load into the target. You'd have to state more precisely what your company is trying to determine for a more focused response.
Kenneth Bland
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Simple audit should compare numbers from the source to the target warehouse. It should count or sum some number of rows to the source.
AGGREGATE TOTALS:
1. Did all the rows in the source end up in the target?
2. Do the columns in the fact table total the same to the source?
3. In aggregate do you match?
Do simple counts of rows in the target compared to row counts of source tables. This is valid for dimension tables. The sums should validate the fact tables.
DETAIL TOTALS:
1. Sum last month's transactions for a single customer. Do they match?
2. If you are loading on a daily or incremental basis then total all tranasctions in the source for yesterday compare to today's loaded records.
3. Do the same for one dimension record for each dimension table.
If you do not know that your data is accurate at an aggregate level then any detail level validation is worthless. Joining a target table to a source table and seeing that a specific column matches proves nothing but that one column is accurate for that specific set of records. If you are accurate at the global or aggregate level then this does prove that the detail level is probably accurate as well.
DO NOT USE THE ETL TO VALIDATE THE ETL.
If you extract data into a staging table and then compare the target to the staging table proves nothing. The original extract which built the staging table could be wrong.
Simple checks:
1. Select count(*) from Customers
(write the result into
ETL_AUDIT.Source_Target = "S"
ETL_AUDIT.Target_Table = "CUSTOMER_DIM"
ETL_AUDIT.MEASURE = Count(*)
2. Select count(*) from Customer_Dim
(write the result into
ETL_AUDIT.Source_Target = "T"
ETL_AUDIT.Target_Table = "CUSTOMER_DIM"
ETL_AUDIT.MEASURE = Count(*)
3. Do this for every dimension.
Calculate the completeness or row count of target divided by row count for source for each table.
Fact tables are more complicated to validate or QA.
1.
2.
All of this is a part of EtlStats. ETL_AUDIT is called ETL_QA_RESULT. The SQL to calculate it measure is stored in ETL_QA_SQL.SQL_TO_RUN. The completeness report is stored in ETL_QA_RESULT_MAP. The jobs are included. The DDL for each table is included. This is all functional and working at my last customer and soon will be working at my current customer.
AGGREGATE TOTALS:
1. Did all the rows in the source end up in the target?
2. Do the columns in the fact table total the same to the source?
3. In aggregate do you match?
Do simple counts of rows in the target compared to row counts of source tables. This is valid for dimension tables. The sums should validate the fact tables.
DETAIL TOTALS:
1. Sum last month's transactions for a single customer. Do they match?
2. If you are loading on a daily or incremental basis then total all tranasctions in the source for yesterday compare to today's loaded records.
3. Do the same for one dimension record for each dimension table.
If you do not know that your data is accurate at an aggregate level then any detail level validation is worthless. Joining a target table to a source table and seeing that a specific column matches proves nothing but that one column is accurate for that specific set of records. If you are accurate at the global or aggregate level then this does prove that the detail level is probably accurate as well.
DO NOT USE THE ETL TO VALIDATE THE ETL.
If you extract data into a staging table and then compare the target to the staging table proves nothing. The original extract which built the staging table could be wrong.
Simple checks:
1. Select count(*) from Customers
(write the result into
ETL_AUDIT.Source_Target = "S"
ETL_AUDIT.Target_Table = "CUSTOMER_DIM"
ETL_AUDIT.MEASURE = Count(*)
2. Select count(*) from Customer_Dim
(write the result into
ETL_AUDIT.Source_Target = "T"
ETL_AUDIT.Target_Table = "CUSTOMER_DIM"
ETL_AUDIT.MEASURE = Count(*)
3. Do this for every dimension.
Calculate the completeness or row count of target divided by row count for source for each table.
Fact tables are more complicated to validate or QA.
1.
Code: Select all
Select sum(Invoice_Amt) from Invoices
ETL_AUDIT.Source_Target = "S"
ETL_AUDIT.Target_Table = "INVOICE_FACT"
ETL_AUDIT.MEASURE = sum()
Code: Select all
Select sum(Invoice_Amount) from Invoice_Fact
ETL_AUDIT.Source_Target = "T"
ETL_AUDIT.Target_Table = "INVOICE_FACT"
ETL_AUDIT.MEASURE = sum()
Last edited by kduke on Wed May 10, 2006 4:52 pm, edited 1 time in total.
Mamu Kim
I think what not to do here is more important than what to do.
This will only validate the data that actually landed in the target. If you have other validation scripts then this is useful.
NEVER VALIDATE AGAINST A STAGING TABLE:
This might be valid in the middle of an ETL run only. Never use this to validate the data warehouse to end users. You cannot validate the ETL using something created about by the ETL.
You cannot use yourself to validate yourself. I always agree with myself.
Using the ETL or tables created by the ETL to validate the data warehouse is validating yourself with yourself. Treat QA as a black box. What if there was no ETL. How do you prove the fact table reports match source reports. I think it is absolutely valid to duplicate most source reports in the data warehouse especially in an ODS. Most of us create ODS tables as well as data warehouse tables. Why not off load all reporting. Obsolete all those source reports by being proactive and end-user customer sensitive.
Always separate these deliverables. Reports should not start until target tables are created. Target tables should be created from report requirements. Source to target documents should be created from these requirements. All of these need to work together. All of these should be designed together but the starting point of one is the ending point of another. You need data in the target in order to run a report. Stagger these deliverables so the starting point of one is the ending point of another. Make sure that if a report needs a column then it gets loaded in the ETL. A data warehouse which starts out with no reporting requirements usually is forced on end-users and its success rate goes way down. It is a solution looking for a problem to solve. It does not meet a business need. Every business needs a data warehouse so they built one. How stupid is this. I don't spend my own money on something I do not need. Why should you spend your company's money on something they do not need. I am sure there really is a need. Go find it and make sure the need gets meet and the end-users are involved.
Code: Select all
Select
InvoiceId
From
InvoiceFact,
InvoiceDim,
SourceLink@Invoices
Where
InvoiceFact.InvoiceKey = InvoiceDim.InvoiceKey
and InvoiceDim.InvoiceId = Invoices.InvoiceId
and InvoiceFact.InvoiceAmount <> Invoices.InvoiceAmt
and InvoiceFact.InvoiceDateKey = '20060101'
NEVER VALIDATE AGAINST A STAGING TABLE:
Code: Select all
Select
InvoiceId
From
InvoiceFact,
InvoiceDim,
Staging.Invoices
Where
InvoiceFact.InvoiceKey = InvoiceDim.InvoiceKey
and InvoiceDim.InvoiceId = Invoices.InvoiceId
and InvoiceFact.InvoiceAmount <> Invoices.InvoiceAmt
and InvoiceFact.InvoiceDateKey = '20060101'
You cannot use yourself to validate yourself. I always agree with myself.
Using the ETL or tables created by the ETL to validate the data warehouse is validating yourself with yourself. Treat QA as a black box. What if there was no ETL. How do you prove the fact table reports match source reports. I think it is absolutely valid to duplicate most source reports in the data warehouse especially in an ODS. Most of us create ODS tables as well as data warehouse tables. Why not off load all reporting. Obsolete all those source reports by being proactive and end-user customer sensitive.
Always separate these deliverables. Reports should not start until target tables are created. Target tables should be created from report requirements. Source to target documents should be created from these requirements. All of these need to work together. All of these should be designed together but the starting point of one is the ending point of another. You need data in the target in order to run a report. Stagger these deliverables so the starting point of one is the ending point of another. Make sure that if a report needs a column then it gets loaded in the ETL. A data warehouse which starts out with no reporting requirements usually is forced on end-users and its success rate goes way down. It is a solution looking for a problem to solve. It does not meet a business need. Every business needs a data warehouse so they built one. How stupid is this. I don't spend my own money on something I do not need. Why should you spend your company's money on something they do not need. I am sure there really is a need. Go find it and make sure the need gets meet and the end-users are involved.
Mamu Kim
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Think MetaStage. It captures process metadata automatically and comes (these days) with some canned reports. It can also do Lineage and Where Used analyses across the entire enterprise.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Short of carrying source system, source file/table and source key/line information through to target, I can't think of any way conveniently to provide the source detail that Carol's end users appear to want.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.