Improving Performance

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
yiminghu
Participant
Posts: 91
Joined: Fri Nov 14, 2003 10:09 am

Improving Performance

Post by yiminghu »

Improving Performance

I was assigned a task to improve performance in some of Datastage jobs running on production servers. Before I start working on my task, I want to ask for some general rules regarding to performance.

The rules I collected so far are listed as below

1. Using hash file for look ups
2. Using Unix sort utility instead of using DS plug-in

What else?

Besides, I also have several questions.

1 . If my lookup rule is based on range instead exact match, (apparently hash file is not a choice), is it better to export lookup table to hash file, then using universe stage to do lookup? In this case, does that mean I have to create hash file under project account instead of using a designated folder? If my look up table contains millon row, is using universe stage still a better solution?

2. The optimal number of stages/transforms in one job

For example, if I have an input file, and I need to reference lookup tables (done through 3-4 transform) before writing to final table. Do you think it will faster if I split the job into 2 jobs, writing the result into a sequential file in first job, then in second job, populate to final table?

I really appreciate if sb. could shed some light.

Thanks,

Carol
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Don't do anything you don't have to. This includes:
  • Be selective when extracting data to process; only select the rows you need from source, rather than grabbing them all then discarding some in DataStage. (This is not applicable for sequential files, where you necessarily have to read them sequentially.)

    Only load columns into hashed files that are actually needed. If there's no line coming out of a column in the Transformer stage, that column's not needed. Don't load it in the first place.

    Only load rows into hashed files that are actually needed. If there are out-of-date code values, they won't be required to process current data. Don't load these rows. (This is a small part of Type 2 SCD strategy.)
You can perform range and other non-equality lookups directly against the database table if the database is local. If it is not, then pre-load the required rows into a UV table. Use an ODBC stage for the former, or a UV stage for the latter, if you need multiple rows returned for a particular search key value.

There is no optimal figure for number of stages in a job. It depends on too many factors to be able to quote a figure. As a general rule I like to keep jobs as simple as possible, preferring more jobs to complexity. But this is more about maintainability than about throughput.

More throughput can be had, provided you have the hardware resources (particularly multiple CPUs and lots of memory) by enabling parallelism. Read Chapter 2 of Parallel Job Developer's Guide (parjdev.pdf) for a discussion of parallelism. In server jobs you can implement pipeline parallelism through row buffering and IPC, link collector and link partitioner stages; you implement partition parallelism through either parallel independent streams in one job or multi-instance jobs running simultaneously.

Prefer sequential file stage and a pre-written, tuned control script to the scripts and files produced by bulk loader stages, particularly ORABULK.

Use statistics collection techniques (see Tracing tab in Director when requesting a job run) - you can record statistics to determine where the "hot spots" in a job are, then concentrate on these as the area for most attention.

Don't ever react to data gathered from small samples. Make sure that you have a sufficiently large volume of data being processed so that the overheads of opening files, connecting to databases, preparing SQL and so on become a miniscule proportion of the overall resources consumed.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
sjhouse
Premium Member
Premium Member
Posts: 43
Joined: Tue Nov 02, 2004 12:11 pm
Location: Minneapolis, MN

Post by sjhouse »

One of the biggest issues I found when I arrived at my current position was that Hashed tables were not sized correctly and were just using the default sizing. We improved the performance substantially when we sized the dynamic hashed tables properly.

Use the DataStage Hashed File Calculater (HFC.exe - Located on the DataStage media as unsupported) to assist you with sizing.

Hope this helps.

Stephen
sudheer05
Participant
Posts: 30
Joined: Sun Oct 02, 2005 1:36 pm

HI

Post by sudheer05 »

sjhouse wrote:One of the biggest issues I found when I arrived at my current position was that Hashed tables were not sized correctly and were just using the default sizing. We improved the performance substantially when we sized the dynamic hashed tables properly.

Use the DataStage Hashed File Calculater (HFC.exe - Located on the DataStage media as unsupported) to assist you with sizing.

Hope this helps.

Stephen
Hi Stephen ,
Can u please tell me where i find datastage media.
Thanks n Regards.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

sudheer,

by media Stephen was referring to the install CDs.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Hashed File Calculator is to be found on the Windows client CD, in a folder called Utilities\Unsupported.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
sudheer05
Participant
Posts: 30
Joined: Sun Oct 02, 2005 1:36 pm

hi

Post by sudheer05 »

Thanks Arndw and Ray.
RAJEEV KATTA
Participant
Posts: 103
Joined: Wed Jul 06, 2005 12:29 am

Re: hi

Post by RAJEEV KATTA »

Hey Carol,
One of the way of improving the performance of DataStage jobs is by pointing the different nodes to different paths in the configuration file provided you have enough disk space in the pointed locations.

Cheers,
Rajeev.
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

Rajeev

This is a server question. Nodes are only on Parallel Extender jobs.
Mamu Kim
RAJEEV KATTA
Participant
Posts: 103
Joined: Wed Jul 06, 2005 12:29 am

Post by RAJEEV KATTA »

Hey carol,
Iam sorry i didnot read it properly that it was regarding server jobs.

Cheers,
Rajeev.
Post Reply