Max number of Hashed files

gateleys · Post by **gateleys** » Wed Feb 08, 2006 8:26 am

Hi,
I have a few questions with respect to references in Server jobs.
1. Is there any limit to the number to hashed files used for references/lookups in a single job?
2. In case there isn't, then what about performance, would it be better to have the job split into a number of jobs so that the lookups (assuming there are many) are spread among them?

Thanks in advance.
gateleys

ArndW · Post by **ArndW** » Wed Feb 08, 2006 8:42 am

There is always going to be a limit somewhere, but I'm not aware of any limits that you will hit when designing your job - if they fit on the canvas they'll compile and run.

If you have Transformer - Transformer links in your jobs, activate interprocess buffering or put in IPC stages, and limit yourself to a couple of lookups per transform then you should do a good job of distributing the load on a multi-cpu system. Splitting this load across several jobs would have the same effect on performance; and it might make it a bit easier to maintain in the long haul instead of having one monster job.

WoMaWil · Post by **WoMaWil** » Wed Feb 08, 2006 8:42 am

Maybe there is somewhere a limit, but before you reach it you won't understand your job.

While producing your job you won't be perfect at the beginning. The more complicate your job, the more points to check for eventual errors.

So construct due to understanding simple and more jobs. It will help you now and later.

Sreenivasulu · Post by **Sreenivasulu** » Wed Feb 08, 2006 8:43 am

Hi,

Regarding question 2
If you are using different references then splitting the job would not help much. I have a job having 9 lookups and 9 transformers in a single job

Regards
Sreeni

gateleys wrote:Hi,

I have a few questions with respect to references in Server jobs.
1. Is there any limit to the number to hashed files used for references/lookups in a single job?
2. In case there isn't, then what about performance, would it be better to have the job split into a number of jobs so that the lookups (assuming there are many) are spread among them?

Thanks in advance.
gateleys

gateleys · Post by **gateleys** » Wed Feb 08, 2006 8:47 am

Thanks guys. Here, I was talking about having over 20 lookups in a single Transformer.

ArndW · Post by **ArndW** » Wed Feb 08, 2006 9:06 am

Break it up into 4 or 5 lookups per transformer. This is assuming you have a multi-cpu system; otherwise none of this makes an effective difference.

I_Server_Whale · Post by **I_Server_Whale** » Wed Feb 08, 2006 9:13 am

Also, as suggested by Ray, it will certainly help designing the job using multiple transformers and sandwiching IPC stages in between these active stages to most utilize the capabilities of a multi-CPU server box.

Thanks,
Naveen.

gateleys · Post by **gateleys** » Wed Feb 08, 2006 3:46 pm

naveendronavalli wrote:Also, as suggested by Ray, it will certainly help designing the job using multiple transformers and sandwiching IPC stages in between these active stages to most utilize the capabilities of a multi-CPU server box.

Thanks,
Naveen.

Hi Naveen,
When using the IPC stage, how do I determine the optimal size of buffer for my job. I have gone through the server guide, and also used this stage a number of times before for similar purpose. However, I have always resorted to using the default of 128K (each for Read and Write). Under what circumstances would a bigger/smaller buffer size give me better performance? Also, are 'enabling in-process row buffering' OR using the IPC stage---one and the same thing?

ArndW · Post by **ArndW** » Wed Feb 08, 2006 4:59 pm

You should enable inter-process row buffering; that is almost the same as using an explicit IPC stage. The buffer size does not normally affect the speed and shouldn't be changed. The buffer size should be set so that it is adequately large for enough rows of data to overcome temporary differences in speed.

If you were to look at the data flow in terms of water and the buffer as a bathtub. If your water flow filling the tub is not at a constant flow or the water doesn't draining at a constant speed then the bathtub is there to keep everything moving - so that if one side stops or blocks it doesn't (immediately) affect the other side. As long as the bathtub holds enough water to let at temporary slowdown in draining not fill it or a temporary slowdown of filling not empty the tub it doesn't matter how large it is - making it the size of a swimming pool won't make anything go faster.

128Kb holds a lot of data. Even if your row size were 1Kb you could still buffer 128 rows - more than enough to buffer out temporary speed differences. In almost all practical applications one side of the buffer will always be faster than the other, so the buffer will almost always be 100% full or 100% empty and the slower process will always have something to do while the faster process will spend a lot of time waiting because the buffer is not ready for it. Usually a buffer of just a couple of rows of data is sufficiently large, so the default of 128Kb is almost ridiculously oversized.

gateleys · Post by **gateleys** » Thu Feb 09, 2006 7:41 am

Beautiful answer. Thanks.

gateleys

DSXchange

Max number of Hashed files

Max number of Hashed files

Re: Max number of Hashed files