Performance for Fact Jobs
Moderators: chulett, rschirm, roy
Performance for Fact Jobs
Hi..
All
We are Developing Fact Job...
I have some Questions
1) Suppose if we have 15 dimension in one Fact Job
--> we are using 7 Dimension Hash Lookup in to one Single Transformer..
and another 8 in another transformer...
so which design will give better performance...
is it ok..if we give all the 15 dimension in one single trasformer...?
or is there any limitations for lookup numbers for 1 transformer
2) case-1 --> i am loading E_XYZ Error table 3 times..through DRS
case-2 --> i am using pivot..and transformer..then loading all the data in E_XYZ..
here i am loading E_XYZ only once..Through DRS
question-> which design will give me better performance....
loading same table 3 times or only once...
i think there is performance..issue here so if any one know ..please explain..me
in detail..related to time & memory performance..
thanx in advance
himanshu
All
We are Developing Fact Job...
I have some Questions
1) Suppose if we have 15 dimension in one Fact Job
--> we are using 7 Dimension Hash Lookup in to one Single Transformer..
and another 8 in another transformer...
so which design will give better performance...
is it ok..if we give all the 15 dimension in one single trasformer...?
or is there any limitations for lookup numbers for 1 transformer
2) case-1 --> i am loading E_XYZ Error table 3 times..through DRS
case-2 --> i am using pivot..and transformer..then loading all the data in E_XYZ..
here i am loading E_XYZ only once..Through DRS
question-> which design will give me better performance....
loading same table 3 times or only once...
i think there is performance..issue here so if any one know ..please explain..me
in detail..related to time & memory performance..
thanx in advance
himanshu
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
There is a maximum number of links supported by each stage type. For the Transformer stage it is 127, with an additional limit of a maximum of one stream input link.
Don't even think of trying to hit this limit "just because you can"! It's an engineering limit set that high because no-one would ever conceivably get there. As a rule of thumb I try to avoid more than four lookups in any one Transformer stage.
Don't even think of trying to hit this limit "just because you can"! It's an engineering limit set that high because no-one would ever conceivably get there. As a rule of thumb I try to avoid more than four lookups in any one Transformer stage.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
again doubt
hi..
ray..
i like to use...IPC stage..
i think it will increase the performance..
if any other.........method..to increse the performance..
please ..tell me..
ray..
i like to use...IPC stage..
i think it will increase the performance..
if any other.........method..to increse the performance..
please ..tell me..
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
There are lots of ways to increase "performance". Just look at all that spam you're getting for internet Viagra!
You have to start by defining performance. Rows/seconds is not a reliable indicator, because row sizes vary. If you want a rate measure, settle on something like MB/minute. Since the key performance indicator in ETL is the ability to fit with a time window, an elapsed time may be a preferable "performance" measure (smaller is better).
The next thing to do is to establish baseline measures - what's the best that the machine, the databases, the network, and so on, can deliver? Use these data to manage expectations. Sample jobs for measuring these things have been posted in the past.
Read Chapter 2 of the Parallel Job Developer's Guide (parjdev.pdf), which is about pipeline parallelism (which is what row buffering/IPC stage gives you), and partition parallelism (which you can get via multi-instance jobs, parallel streams within jobs, or parallel extender).
Other tips revolve around not doing unnecessary processing; don't select any rows or columns you don't need to process, especially when loading hashed files.
And, of course, you can get more processing power. Even unto mainframe, if that's what you need. There are sites running DataStage on UNIX servers with 64 CPUs, and on MPP clusters. Search for "benchmark" on Ascential Softwareweb site.
You have to start by defining performance. Rows/seconds is not a reliable indicator, because row sizes vary. If you want a rate measure, settle on something like MB/minute. Since the key performance indicator in ETL is the ability to fit with a time window, an elapsed time may be a preferable "performance" measure (smaller is better).
The next thing to do is to establish baseline measures - what's the best that the machine, the databases, the network, and so on, can deliver? Use these data to manage expectations. Sample jobs for measuring these things have been posted in the past.
Read Chapter 2 of the Parallel Job Developer's Guide (parjdev.pdf), which is about pipeline parallelism (which is what row buffering/IPC stage gives you), and partition parallelism (which you can get via multi-instance jobs, parallel streams within jobs, or parallel extender).
Other tips revolve around not doing unnecessary processing; don't select any rows or columns you don't need to process, especially when loading hashed files.
And, of course, you can get more processing power. Even unto mainframe, if that's what you need. There are sites running DataStage on UNIX servers with 64 CPUs, and on MPP clusters. Search for "benchmark" on Ascential Softwareweb site.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
There are at least two ways to get partition parallelism in server jobs.
But the description of what it is is to be found in the Parallel Job Developer's Guide. Please do read this, even though you aren't in a position to develop parallel jobs. It is merely an introduction to the concepts.
You should investigate jobs that are capable of running multiple instances. Another possibility is multiple streams within the one server job, possibly fed by link partitioner and retrieved by link collector stages. Or row splitter and row merger stages, if you want to work column-by-column for your parallelism.
Finally, there's nothing to stop you running N jobs at once using a Job Sequence or job control code that starts N jobs instead of 1.
In all of the above beware that your computer, whatever it is, is a finite resource; overloading it can have more deleterious effects on throughput than underloading it. In particular, if you only have one CPU, don't even bother with parallelism.
But the description of what it is is to be found in the Parallel Job Developer's Guide. Please do read this, even though you aren't in a position to develop parallel jobs. It is merely an introduction to the concepts.
You should investigate jobs that are capable of running multiple instances. Another possibility is multiple streams within the one server job, possibly fed by link partitioner and retrieved by link collector stages. Or row splitter and row merger stages, if you want to work column-by-column for your parallelism.
Finally, there's nothing to stop you running N jobs at once using a Job Sequence or job control code that starts N jobs instead of 1.
In all of the above beware that your computer, whatever it is, is a finite resource; overloading it can have more deleterious effects on throughput than underloading it. In particular, if you only have one CPU, don't even bother with parallelism.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.