Improving Performance

yiminghu · Post by **yiminghu** » Mon May 30, 2005 5:54 pm

Improving Performance

I was assigned a task to improve performance in some of Datastage jobs running on production servers. Before I start working on my task, I want to ask for some general rules regarding to performance.

The rules I collected so far are listed as below

1. Using hash file for look ups
2. Using Unix sort utility instead of using DS plug-in

What else?

Besides, I also have several questions.

1 . If my lookup rule is based on range instead exact match, (apparently hash file is not a choice), is it better to export lookup table to hash file, then using universe stage to do lookup? In this case, does that mean I have to create hash file under project account instead of using a designated folder? If my look up table contains millon row, is using universe stage still a better solution?

2. The optimal number of stages/transforms in one job

For example, if I have an input file, and I need to reference lookup tables (done through 3-4 transform) before writing to final table. Do you think it will faster if I split the job into 2 jobs, writing the result into a sequential file in first job, then in second job, populate to final table?

I really appreciate if sb. could shed some light.

Thanks,

Carol

ray.wurlod · Post by **ray.wurlod** » Mon May 30, 2005 7:44 pm

Don't do anything you don't have to. This includes:

Be selective when extracting data to process; only select the rows you need from source, rather than grabbing them all then discarding some in DataStage. (This is not applicable for sequential files, where you necessarily have to read them sequentially.)

Only load columns into hashed files that are actually needed. If there's no line coming out of a column in the Transformer stage, that column's not needed. Don't load it in the first place.

Only load rows into hashed files that are actually needed. If there are out-of-date code values, they won't be required to process current data. Don't load these rows. (This is a small part of Type 2 SCD strategy.)

You can perform range and other non-equality lookups directly against the database table if the database is local. If it is not, then pre-load the required rows into a UV table. Use an ODBC stage for the former, or a UV stage for the latter, if you need multiple rows returned for a particular search key value.

There is no optimal figure for number of stages in a job. It depends on too many factors to be able to quote a figure. As a general rule I like to keep jobs as simple as possible, preferring more jobs to complexity. But this is more about maintainability than about throughput.

More throughput can be had, provided you have the hardware resources (particularly multiple CPUs and lots of memory) by enabling parallelism. Read Chapter 2 of Parallel Job Developer's Guide (parjdev.pdf) for a discussion of parallelism. In server jobs you can implement pipeline parallelism through row buffering and IPC, link collector and link partitioner stages; you implement partition parallelism through either parallel independent streams in one job or multi-instance jobs running simultaneously.

Prefer sequential file stage and a pre-written, tuned control script to the scripts and files produced by bulk loader stages, particularly ORABULK.

Use statistics collection techniques (see Tracing tab in Director when requesting a job run) - you can record statistics to determine where the "hot spots" in a job are, then concentrate on these as the area for most attention.

Don't ever react to data gathered from small samples. Make sure that you have a sufficiently large volume of data being processed so that the overheads of opening files, connecting to databases, preparing SQL and so on become a miniscule proportion of the overall resources consumed.

sjhouse · Post by **sjhouse** » Tue May 31, 2005 9:58 am

One of the biggest issues I found when I arrived at my current position was that Hashed tables were not sized correctly and were just using the default sizing. We improved the performance substantially when we sized the dynamic hashed tables properly.

Use the DataStage Hashed File Calculater (HFC.exe - Located on the DataStage media as unsupported) to assist you with sizing.

Hope this helps.

Stephen

sudheer05 · Post by **sudheer05** » Sat Oct 22, 2005 4:55 am

sjhouse wrote:One of the biggest issues I found when I arrived at my current position was that Hashed tables were not sized correctly and were just using the default sizing. We improved the performance substantially when we sized the dynamic hashed tables properly.

Use the DataStage Hashed File Calculater (HFC.exe - Located on the DataStage media as unsupported) to assist you with sizing.

Hope this helps.

Stephen

Hi Stephen ,
Can u please tell me where i find datastage media.
Thanks n Regards.

ArndW · Post by **ArndW** » Sat Oct 22, 2005 5:04 am

sudheer,

by media Stephen was referring to the install CDs.

ray.wurlod · Post by **ray.wurlod** » Sat Oct 22, 2005 4:24 pm

Hashed File Calculator is to be found on the Windows client CD, in a folder called Utilities\Unsupported.

sudheer05 · Post by **sudheer05** » Mon Oct 24, 2005 2:38 am

Thanks Arndw and Ray.

RAJEEV KATTA · Post by **RAJEEV KATTA** » Mon Oct 24, 2005 2:48 am

Hey Carol,
One of the way of improving the performance of DataStage jobs is by pointing the different nodes to different paths in the configuration file provided you have enough disk space in the pointed locations.

Cheers,
Rajeev.

kduke · Post by **kduke** » Mon Oct 24, 2005 5:26 am

Rajeev

This is a server question. Nodes are only on Parallel Extender jobs.

RAJEEV KATTA · Post by **RAJEEV KATTA** » Mon Oct 24, 2005 5:36 am

Hey carol,
Iam sorry i didnot read it properly that it was regarding server jobs.

Cheers,
Rajeev.

DSXchange

Improving Performance

Improving Performance

HI

hi

Re: hi