ETL auditing

yiminghu · Post by **yiminghu** » Mon Jan 02, 2006 9:18 am

Hi,

Just wonder how do you implement ETL auditing in your company? In our company, we have some high level auditing, like how many input rows, how many output rows and how many rejected rows. In each table, we also have DW_UPDATE_BATCH_KEY, DW_CREATE_BATCH_KEY auditing fields. But business wants more detail level aduiting, they want to know which individual input row end up ?

Can someone shed some light on that?

Thanks a lot,

Carol

chulett · Post by **chulett** » Mon Jan 02, 2006 9:24 am

yiminghu wrote:But business wants more detail level aduiting, they want to know which individual input row end up ?

Can you shed some light on exactly what this might mean - "Which individual row end up"?

kcbland · Post by **kcbland** » Mon Jan 02, 2006 9:25 am

Usually you have the originating key information, or at least a unique identifier stored with the data. In the case of surrogate key assignment, this is an absolute minimum. As far as auditing, you can perform reconciliations to show that all source rows are loaded via count comparisons based on natural keys in the target against the source keys, but that can be tricky if there's sophisticated rules for which rows are allow to load into the target. You'd have to state more precisely what your company is trying to determine for a more focused response.

kduke · Post by **kduke** » Mon Jan 02, 2006 10:41 am

Simple audit should compare numbers from the source to the target warehouse. It should count or sum some number of rows to the source.

AGGREGATE TOTALS:
1. Did all the rows in the source end up in the target?
2. Do the columns in the fact table total the same to the source?
3. In aggregate do you match?

Do simple counts of rows in the target compared to row counts of source tables. This is valid for dimension tables. The sums should validate the fact tables.

DETAIL TOTALS:
1. Sum last month's transactions for a single customer. Do they match?
2. If you are loading on a daily or incremental basis then total all tranasctions in the source for yesterday compare to today's loaded records.
3. Do the same for one dimension record for each dimension table.

If you do not know that your data is accurate at an aggregate level then any detail level validation is worthless. Joining a target table to a source table and seeing that a specific column matches proves nothing but that one column is accurate for that specific set of records. If you are accurate at the global or aggregate level then this does prove that the detail level is probably accurate as well.

DO NOT USE THE ETL TO VALIDATE THE ETL.

If you extract data into a staging table and then compare the target to the staging table proves nothing. The original extract which built the staging table could be wrong.

Simple checks:

1. Select count(*) from Customers
(write the result into
ETL_AUDIT.Source_Target = "S"
ETL_AUDIT.Target_Table = "CUSTOMER_DIM"
ETL_AUDIT.MEASURE = Count(*)
2. Select count(*) from Customer_Dim
(write the result into
ETL_AUDIT.Source_Target = "T"
ETL_AUDIT.Target_Table = "CUSTOMER_DIM"
ETL_AUDIT.MEASURE = Count(*)
3. Do this for every dimension.

Calculate the completeness or row count of target divided by row count for source for each table.

Fact tables are more complicated to validate or QA.

1.

Code: Select all

Select sum(Invoice_Amt) from Invoices
   ETL_AUDIT.Source_Target = "S"
   ETL_AUDIT.Target_Table = "INVOICE_FACT"
   ETL_AUDIT.MEASURE = sum()

2.

Code: Select all

Select sum(Invoice_Amount) from Invoice_Fact
   ETL_AUDIT.Source_Target = "T"
   ETL_AUDIT.Target_Table = "INVOICE_FACT"
   ETL_AUDIT.MEASURE = sum()

All of this is a part of EtlStats. ETL_AUDIT is called ETL_QA_RESULT. The SQL to calculate it measure is stored in ETL_QA_SQL.SQL_TO_RUN. The completeness report is stored in ETL_QA_RESULT_MAP. The jobs are included. The DDL for each table is included. This is all functional and working at my last customer and soon will be working at my current customer.

kduke · Post by **kduke** » Mon Jan 02, 2006 11:12 am

I think what not to do here is more important than what to do.

Code: Select all

Select 
   InvoiceId
From
   InvoiceFact,
   InvoiceDim,
   SourceLink@Invoices
Where
   InvoiceFact.InvoiceKey = InvoiceDim.InvoiceKey 
   and InvoiceDim.InvoiceId = Invoices.InvoiceId 
   and InvoiceFact.InvoiceAmount <> Invoices.InvoiceAmt
   and InvoiceFact.InvoiceDateKey = '20060101'

This will only validate the data that actually landed in the target. If you have other validation scripts then this is useful.

NEVER VALIDATE AGAINST A STAGING TABLE:

Code: Select all

Select 
   InvoiceId
From
   InvoiceFact,
   InvoiceDim,
   Staging.Invoices
Where
   InvoiceFact.InvoiceKey = InvoiceDim.InvoiceKey 
   and InvoiceDim.InvoiceId = Invoices.InvoiceId 
   and InvoiceFact.InvoiceAmount <> Invoices.InvoiceAmt
   and InvoiceFact.InvoiceDateKey = '20060101'

This might be valid in the middle of an ETL run only. Never use this to validate the data warehouse to end users. You cannot validate the ETL using something created about by the ETL.

You cannot use yourself to validate yourself. I always agree with myself.

Using the ETL or tables created by the ETL to validate the data warehouse is validating yourself with yourself. Treat QA as a black box. What if there was no ETL. How do you prove the fact table reports match source reports. I think it is absolutely valid to duplicate most source reports in the data warehouse especially in an ODS. Most of us create ODS tables as well as data warehouse tables. Why not off load all reporting. Obsolete all those source reports by being proactive and end-user customer sensitive.

Always separate these deliverables. Reports should not start until target tables are created. Target tables should be created from report requirements. Source to target documents should be created from these requirements. All of these need to work together. All of these should be designed together but the starting point of one is the ending point of another. You need data in the target in order to run a report. Stagger these deliverables so the starting point of one is the ending point of another. Make sure that if a report needs a column then it gets loaded in the ETL. A data warehouse which starts out with no reporting requirements usually is forced on end-users and its success rate goes way down. It is a solution looking for a problem to solve. It does not meet a business need. Every business needs a data warehouse so they built one. How stupid is this. I don't spend my own money on something I do not need. Why should you spend your company's money on something they do not need. I am sure there really is a need. Go find it and make sure the need gets meet and the end-users are involved.

ray.wurlod · Post by **ray.wurlod** » Mon Jan 02, 2006 4:01 pm

Think MetaStage. It captures process metadata automatically and comes (these days) with some canned reports. It can also do Lineage and Where Used analyses across the entire enterprise.

kduke · Post by **kduke** » Mon Jan 02, 2006 5:05 pm

Ray

I think the question was more based around QA type auditing not ETL design audit. They already have row counts supplied by MetaStage. Can MetaStage help in a QA process besides supplying row counts?

ray.wurlod · Post by **ray.wurlod** » Mon Jan 02, 2006 7:51 pm

Lineage analysis can prove that the data have come from the correct source and been subject to the correct processing on their way through.

ray.wurlod · Post by **ray.wurlod** » Mon Jan 02, 2006 7:53 pm

Short of carrying source system, source file/table and source key/line information through to target, I can't think of any way conveniently to provide the source detail that Carol's end users appear to want.

kduke · Post by **kduke** » Mon Jan 02, 2006 8:03 pm

Good point.

DSXchange

ETL auditing

ETL auditing

Re: ETL auditing