Before or after??

dh_Madhu · Post by **dh_Madhu** » Thu Sep 29, 2005 1:06 am

What are the advantages or consequences of addressing the data quality issues after loading the data into the warehouse?
What would be an ideal one for a banking domain??
Thanks.

ray.wurlod · Post by **ray.wurlod** » Thu Sep 29, 2005 1:19 am

After has no real advantages for the ETL process. But that should not preclude ongoing data quality audits of the DW to ensure that processes are not out of control.

It is much better to address data quality issues before or during ETL. That way you are much more likely to load "good" data and handle fewer rejects.

roy · Post by **roy** » Thu Sep 29, 2005 1:44 am

Hi,
If your checking did the benefitiary/ies name changed and update on that basis then cleansing the names after the ETL will cause them to pop up as updates where in fact they may not be (the same will happen on every field you cleans and gets changed)
So I think doing it before the ETL is a better option.

IHTH,

vmcburney · Post by **vmcburney** » Thu Sep 29, 2005 5:16 pm

You can do data quality checking before, during and after ETL load. The most effective data quality checking up front is profiling, with ProfileStage being the obvious choice for a DataStage project and the upcoming releases making it even better.

In a banking domain where your amounts must tally the profiling tells you what data you are going to have trouble loading, what amount fields have negative or null values etc. A little profiling up front can save you a lot of pain later on.

In Dune the motto was "the spice must flow", in BI loads the motto is "the data must flow" so you are often letting through data that either is missing referential integrity and needs to be repaired (via augmentation) or you are defaulting empty fields or making up descriptions etc.

I place each row of data in four categories:
- Processed without issue.
- Processed but failing one or more business rules.
- Rejected failing one or more business rules.
- Rejected by the architecture.

In transit data quality can count these four categories and turn these into metrics. The last category can lead to row leakage where transformer rejects or database rejects are dropped off. You need good reject handling to make sure you trap these. Our business rules are usually written in a transformer and can be set to reject or continue on a rule by rule basis.

After data quality can measure what percentage of your data has these known data quality issues such as augmentation and defaulted fields. A good data analyst can also proactively find issues that had been missed during ETL testing as an ongoing exercise.

Another form of after data quality is the quality assurance audits on row counts. Kim Duke has done some work and his qa table approach is included in the etlstats download.

kduke · Post by **kduke** » Thu Sep 29, 2005 8:56 pm

You can pay me now or pay me later but quality matters.

ray.wurlod · Post by **ray.wurlod** » Thu Sep 29, 2005 9:15 pm

In that particular situation, quantity matters too!

vmcburney · Post by **vmcburney** » Thu Sep 29, 2005 9:22 pm

How many DWs can tell you the quality of the quantity?

ray.wurlod · Post by **ray.wurlod** » Thu Sep 29, 2005 9:24 pm

Every well-implemented one!

ray.wurlod · Post by **ray.wurlod** » Thu Sep 29, 2005 9:26 pm

I really should add to Vincent's more serious post that a tool such as QualityStage can be an immense asset in standardising the form of data, which is part of cleansing. You can develop separately then iintegrate into DataStage job streams or not, depending on how you want to process. Madhu, you will remember that some of this was done at your gig in Australia late last year and early this year.

dh_Madhu · Post by **dh_Madhu** » Mon Oct 03, 2005 12:38 am

Thanks to all for the inputs.
From my side, I always believe that Prevention is better than Cure and that is what most of you have suggested. I also think that junk data should reside in a system for the least amount of time.The more the time, the costlier it becomes to handle it and, Ray, I do remember those DQ issues that were hogging the show because of these reasons.And as a part of curing I believe a post DQ operation of a much lower intensity also has to be inevitably in place at least for a while until things get settled.
I could not mention what post cleaning is all about here as am not clear as to how and why in the first place would they wish to handle the whole lot of quality issues after load.
However, my views have also been conveyed and it has generated quite an interest to develop a new insight on how to go about the Data Quality issues.