Page 1 of 1

data cleansing

Posted: Wed Jan 19, 2005 11:35 am
by harithay
i am new to datastage ,

i know defintion of data cleansing, but how we perfrom this in real time datastage projects, can anybody explain with some examples?

Posted: Wed Jan 19, 2005 11:50 am
by Alexandre
You don't have data cleaning facility in Datastage, for this, take a look at QualityStage...

But, in Datastage you can do "Data Quality", ie. analysis of your data in order to produce reports (unicity, count, etc. ...) and apply business rules (in transformer, based on routines or using reference) and eventually take decision based on those results...

Anyway, it is just data transformation, like the T in ETL...

Posted: Wed Jan 19, 2005 1:43 pm
by kommven
I can give you a simple definition for Data Cleansing. Its like Querying your DBMS with select col names so that you wont get unwanted data.

Data Cleansing it at various levels at the least grain size is at row which can be achvd. by using simple Transformer job.
At the highest level is at an aggregated grain size which is possible by using a aggregator.

Even in Source or target stage also you can specific the specific columns you are interested in.

I appreciate someone, comment on this.

Kommu

Posted: Wed Jan 19, 2005 3:59 pm
by vmcburney
Most data cleansing in DataStage is done through a transformer or a QualityStage plugin.

In a transformer cleansing is performed through BASIC code, constraints or lookups.
- BASIC code lets you convert the data, for example trimming blanks, checking a text field for a valid date and converting it to internal date format, removing invalid characters etc. Almost any cleansing business rule can be written in BASIC code.
- Constraints let you filter invalid data and send it to a reject file. If BASIC cleansing code finds a problem a constraint can turn it into a reject row.
- Lookups can be used to validate codes or foreign key relationships.
- QualityStage is the best option for text parsing such as breaking down a single address string, or breaking names into first name and surname. There are a huge number of cleansing functions in this product. By using the DataStage plugin you can retrieve clean or rejected records out of the QualityStage job back into the DataStage job for further processing.

Posted: Thu Jan 20, 2005 3:15 am
by Sainath.Srinivasan
To keep it short, DataStage 'transforms' (i.e. changes) the data as per your need. So if you already know a data quality problem, you can leave, rectify or escalate it using DataStage.

But it is not a tool to 'identify' any unknown quality problem. To do that, you need some other tool such as QualityStage or plain SQL.

In the presence of low quality data, DataStage will either return garbage or fail, depending on how the jobs are designed. Hence you may be better off performing simple quality checks in the source system periodically to analyse any possible quality defects and confirm that meets the quality standards.

That is why taking periodic backups are important in case you miss something.

Posted: Thu Jan 20, 2005 3:53 am
by ray.wurlod
Even better for identifying data problems are ProfileStage (formerly known as MetaRecon) or AuditStage (formerly known as Quality Manager).

This is exactly what those tools do.

Posted: Thu Jan 20, 2005 4:17 am
by roy
Hi,
Well Profile Stage only gives you a data statistics of actual data present in your data sources with no change options.
Quality Stage on the other hand will let you Identify look alike, soundex and pattern matching of data which is not equal but might represent the same data and standardise then, in the process you might even deduplicate multiple entries of the same actual data even if not written in the same mannor.
they say Quality Stage has around 75-80% (If my memory serves me)standardisation out of the box or in a short time of development (around 2 weeks) depending on if (if you'll need localization) and how extensive will the localization process take if not implemented already.

IHTH,

Posted: Thu Jan 20, 2005 4:45 am
by Sainath.Srinivasan
Yep. I mistyped quality manager into quality stage. So to rephrase - you need quality manager or metarecon.

The tool integrity is more to reformat any flat file to bring to a general understanding and pattern.