Page 1 of 1

Performance for Fact Jobs

Posted: Wed Sep 08, 2004 2:43 am
by him121
Hi..
All
We are Developing Fact Job...
I have some Questions

1) Suppose if we have 15 dimension in one Fact Job
--> we are using 7 Dimension Hash Lookup in to one Single Transformer..
and another 8 in another transformer...
so which design will give better performance...
is it ok..if we give all the 15 dimension in one single trasformer...?
or is there any limitations for lookup numbers for 1 transformer

2) case-1 --> i am loading E_XYZ Error table 3 times..through DRS
case-2 --> i am using pivot..and transformer..then loading all the data in E_XYZ..
here i am loading E_XYZ only once..Through DRS

question-> which design will give me better performance....
loading same table 3 times or only once...


i think there is performance..issue here so if any one know ..please explain..me
in detail..related to time & memory performance..

thanx in advance

himanshu

Posted: Wed Sep 08, 2004 11:31 am
by ketfos
Hi,
I donot think datastage puts any limit to the number of lookups in the transformer stage.
Yes performance would be an issue, if the look ups on the transformer exceed a certain limit.
Again this would vary with amount of data in the lookups.


Ketfos

Posted: Wed Sep 08, 2004 2:51 pm
by kduke
When you route data from one transform to another transform then that will slow a job down more than 15 lookups on one transform. If you cannot combine these into one transform then let us help you.

Build it both ways and let us know which is faster.

Posted: Wed Sep 08, 2004 5:00 pm
by ray.wurlod
There is a maximum number of links supported by each stage type. For the Transformer stage it is 127, with an additional limit of a maximum of one stream input link.

Don't even think of trying to hit this limit "just because you can"! It's an engineering limit set that high because no-one would ever conceivably get there. As a rule of thumb I try to avoid more than four lookups in any one Transformer stage.

Posted: Wed Sep 08, 2004 5:13 pm
by amsh76
But Ray, doesn't that mean that one will end up using more number of transformer..or will end up creating multiple jobs, with job dependency ?? :?:

Posted: Wed Sep 08, 2004 7:49 pm
by ray.wurlod
Exactly. Usually more Transformer stages. Separate them using IPC stages (or just use row buffering) to achieve pipeline parallelism.

again doubt

Posted: Fri Sep 10, 2004 8:15 am
by him121
hi..
ray..
i like to use...IPC stage..
i think it will increase the performance..
if any other.........method..to increse the performance..
please ..tell me..

Posted: Fri Sep 10, 2004 5:28 pm
by ray.wurlod
There are lots of ways to increase "performance". Just look at all that spam you're getting for internet Viagra! :lol:

You have to start by defining performance. Rows/seconds is not a reliable indicator, because row sizes vary. If you want a rate measure, settle on something like MB/minute. Since the key performance indicator in ETL is the ability to fit with a time window, an elapsed time may be a preferable "performance" measure (smaller is better).

The next thing to do is to establish baseline measures - what's the best that the machine, the databases, the network, and so on, can deliver? Use these data to manage expectations. Sample jobs for measuring these things have been posted in the past.

Read Chapter 2 of the Parallel Job Developer's Guide (parjdev.pdf), which is about pipeline parallelism (which is what row buffering/IPC stage gives you), and partition parallelism (which you can get via multi-instance jobs, parallel streams within jobs, or parallel extender).

Other tips revolve around not doing unnecessary processing; don't select any rows or columns you don't need to process, especially when loading hashed files.

And, of course, you can get more processing power. Even unto mainframe, if that's what you need. There are sites running DataStage on UNIX servers with 64 CPUs, and on MPP clusters. Search for "benchmark" on Ascential Softwareweb site.

Posted: Sun Sep 12, 2004 10:51 pm
by him121
But.
Ray
we have Win 2000..as Server
though there is chances of Parallism...
how can i achieve this in Win System..
only using IPC..or there are still some other Methods..

waiting for reply..

Posted: Sun Sep 12, 2004 11:38 pm
by ray.wurlod
There are at least two ways to get partition parallelism in server jobs.

But the description of what it is is to be found in the Parallel Job Developer's Guide. Please do read this, even though you aren't in a position to develop parallel jobs. It is merely an introduction to the concepts.

You should investigate jobs that are capable of running multiple instances. Another possibility is multiple streams within the one server job, possibly fed by link partitioner and retrieved by link collector stages. Or row splitter and row merger stages, if you want to work column-by-column for your parallelism.
Finally, there's nothing to stop you running N jobs at once using a Job Sequence or job control code that starts N jobs instead of 1.

In all of the above beware that your computer, whatever it is, is a finite resource; overloading it can have more deleterious effects on throughput than underloading it. In particular, if you only have one CPU, don't even bother with parallelism.