Performance for Fact Jobs

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
him121
Premium Member
Premium Member
Posts: 55
Joined: Sat Aug 07, 2004 1:50 am

Performance for Fact Jobs

Post by him121 »

Hi..
All
We are Developing Fact Job...
I have some Questions

1) Suppose if we have 15 dimension in one Fact Job
--> we are using 7 Dimension Hash Lookup in to one Single Transformer..
and another 8 in another transformer...
so which design will give better performance...
is it ok..if we give all the 15 dimension in one single trasformer...?
or is there any limitations for lookup numbers for 1 transformer

2) case-1 --> i am loading E_XYZ Error table 3 times..through DRS
case-2 --> i am using pivot..and transformer..then loading all the data in E_XYZ..
here i am loading E_XYZ only once..Through DRS

question-> which design will give me better performance....
loading same table 3 times or only once...


i think there is performance..issue here so if any one know ..please explain..me
in detail..related to time & memory performance..

thanx in advance

himanshu
ketfos
Participant
Posts: 562
Joined: Mon May 03, 2004 8:58 pm
Location: san francisco
Contact:

Post by ketfos »

Hi,
I donot think datastage puts any limit to the number of lookups in the transformer stage.
Yes performance would be an issue, if the look ups on the transformer exceed a certain limit.
Again this would vary with amount of data in the lookups.


Ketfos
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

When you route data from one transform to another transform then that will slow a job down more than 15 lookups on one transform. If you cannot combine these into one transform then let us help you.

Build it both ways and let us know which is faster.
Mamu Kim
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

There is a maximum number of links supported by each stage type. For the Transformer stage it is 127, with an additional limit of a maximum of one stream input link.

Don't even think of trying to hit this limit "just because you can"! It's an engineering limit set that high because no-one would ever conceivably get there. As a rule of thumb I try to avoid more than four lookups in any one Transformer stage.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
amsh76
Charter Member
Charter Member
Posts: 118
Joined: Wed Mar 10, 2004 10:58 pm

Post by amsh76 »

But Ray, doesn't that mean that one will end up using more number of transformer..or will end up creating multiple jobs, with job dependency ?? :?:
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Exactly. Usually more Transformer stages. Separate them using IPC stages (or just use row buffering) to achieve pipeline parallelism.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
him121
Premium Member
Premium Member
Posts: 55
Joined: Sat Aug 07, 2004 1:50 am

again doubt

Post by him121 »

hi..
ray..
i like to use...IPC stage..
i think it will increase the performance..
if any other.........method..to increse the performance..
please ..tell me..
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

There are lots of ways to increase "performance". Just look at all that spam you're getting for internet Viagra! :lol:

You have to start by defining performance. Rows/seconds is not a reliable indicator, because row sizes vary. If you want a rate measure, settle on something like MB/minute. Since the key performance indicator in ETL is the ability to fit with a time window, an elapsed time may be a preferable "performance" measure (smaller is better).

The next thing to do is to establish baseline measures - what's the best that the machine, the databases, the network, and so on, can deliver? Use these data to manage expectations. Sample jobs for measuring these things have been posted in the past.

Read Chapter 2 of the Parallel Job Developer's Guide (parjdev.pdf), which is about pipeline parallelism (which is what row buffering/IPC stage gives you), and partition parallelism (which you can get via multi-instance jobs, parallel streams within jobs, or parallel extender).

Other tips revolve around not doing unnecessary processing; don't select any rows or columns you don't need to process, especially when loading hashed files.

And, of course, you can get more processing power. Even unto mainframe, if that's what you need. There are sites running DataStage on UNIX servers with 64 CPUs, and on MPP clusters. Search for "benchmark" on Ascential Softwareweb site.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
him121
Premium Member
Premium Member
Posts: 55
Joined: Sat Aug 07, 2004 1:50 am

Post by him121 »

But.
Ray
we have Win 2000..as Server
though there is chances of Parallism...
how can i achieve this in Win System..
only using IPC..or there are still some other Methods..

waiting for reply..
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

There are at least two ways to get partition parallelism in server jobs.

But the description of what it is is to be found in the Parallel Job Developer's Guide. Please do read this, even though you aren't in a position to develop parallel jobs. It is merely an introduction to the concepts.

You should investigate jobs that are capable of running multiple instances. Another possibility is multiple streams within the one server job, possibly fed by link partitioner and retrieved by link collector stages. Or row splitter and row merger stages, if you want to work column-by-column for your parallelism.
Finally, there's nothing to stop you running N jobs at once using a Job Sequence or job control code that starts N jobs instead of 1.

In all of the above beware that your computer, whatever it is, is a finite resource; overloading it can have more deleterious effects on throughput than underloading it. In particular, if you only have one CPU, don't even bother with parallelism.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply