Data Cleansing/Business Rules

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

It's very hard to provide generic answers without some knowledge about what you are trying to do. One thing to contemplate is to use Quality Manager to perform an initial audit of data quality so that, even though you can't actually change the database, you can at least get a "scientific" measure of how bad the data quality is.
That said, for the rest the answer is to do things as efficiently as possible. Make as much use as possible of in-line expressions (in Transformer stages, or in Transforms), and use optimally efficient coding techniques when you are forced to create Routines. In the main, this is not doing anything you don't have to do (such as extraneous file opens), using more efficient rather than less efficient BASIC statements, and keeping as much as possible in memory for as long as possible (see COMMON in the BASIC manual for example).
Let me pre-empt your next question. There is no published list of "more efficient rather than less efficient BASIC statements", mainly because what is most efficient will depend to some extent on the context in which it is used.
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

Have you looked at the Ascential Integrity product? While Quality Manager is good at locating and reporting on data quality problems the Integrity tool can be used to locate and clean problems.

Since you are using text files as a source you will also benefit from the Integrity products ability to process text fields such as addresses and phone numbers.

If you process your files sequentially you could save time by running an Integrity cleanse in parallel with DataStage processing. Eg. cleanse the second file while the first file is being loaded.
WoMaWil
Participant
Posts: 482
Joined: Thu Mar 13, 2003 7:17 am
Location: Amsterdam

Post by WoMaWil »

A way to keep your routines and to make the process a bit faster is to write a special job or part of a job where you have flat file as input and flat file as output.

And don't forget to reengeneer you routines the way Ray Wurlod sugests.
Post Reply