Page 1 of 1

Fine tuning a parallel job

Posted: Mon Dec 14, 2009 6:22 am
by bhasannew
Hi,

When i ran the job in single node it is taking 10 mins around. If i run it using the 4 nodes then it is taking 30 mins around to process all the records. I am hash partiitoning the incoming records based on key fields.
Please throw some light on this. I would like to know why the performance is decreasing even after increasing the nodes.

Thanks,
bhasan.

Posted: Mon Dec 14, 2009 6:39 am
by BugFree
hi,

Need to know your exact job design. Whats the source, stages that you have used, target details etc.

Posted: Mon Dec 14, 2009 6:48 am
by bhasannew
The stages used inthe job are

MQRead ---> Transformer --> column import ---> Oracle Stage

Hash partitioning is done in the MQRead stage

Posted: Mon Dec 14, 2009 6:59 am
by priyadarshikunal
When the time increases after increasing the nodes significantly, it generally means you are running out of resources. Increase in number of nodes Increases the CPU/Memory/IO utilization.

First try to run that job on different number of nodes while monitoring the resource utilization on the server to find out optimal number of nodes on that configuration.

Posted: Mon Dec 14, 2009 7:22 am
by bhasannew
hi,

thanks for your replies.
based on my observation of Statistics,
when i am significantly increasing the nodes the time is decreasing. but when i run the same job in single node it is taking very less time.
May i know the reason behind this?

I think since i am hash partitioning in the source stage, to process the records it is taking much time, but if it is single node no need of hashing as it is running in single node.

Am i rite?

Thanks in advance
Bhasan

Posted: Mon Dec 14, 2009 7:28 am
by chulett
Rite? No. Right? Yes.

Posted: Mon Dec 14, 2009 7:32 am
by bhasannew
hi chulett,

Sorry for using unofficial words (eg: rite).

So based on your reply, i assume that my conclusion is correct.
am i right now? :)

thanks,
Bhasan.

Posted: Mon Dec 14, 2009 7:35 am
by chulett
As before, yes - right.

Posted: Mon Dec 14, 2009 8:13 am
by Sainath.Srinivasan
I will like to differ from that assumption.

It is probably because your data volume is low that it can be processed in single node in 10 seconds.

But when you run in 4 nodes, there is a delay to create and spawn that many process and hence the delay.

Posted: Mon Dec 14, 2009 8:18 am
by chulett
That 'delay' would also be affected by the hash partitioning, which would not be happening on a single node. So while it may not be the complete answer as to why it is faster, but it is correct that no hashing is needed on a single node job run.

Posted: Mon Dec 14, 2009 8:23 am
by Sainath.Srinivasan
Single node does do the partition but all rows landup in same bucket.

Partitioning is extremely fast and will take less than a nano-second (in SMP).

So the impact from partition - if any - will be less than a second.

Posted: Mon Dec 14, 2009 8:40 am
by chulett
Less than a nano-second, eh? I suppose you've actually measured that. :wink:

Of course, one should be dumping the score in both cases and comparing startup times and the work being done, single versus multiple nodes and contrasting that to the volume of data being processed. For all we know, a Server job would be the fastest implementation. And you're making an assumption that all nodes are on the same physical server, again for all we know they're distributed across multiple servers.

As always, lots of factors involved.

Posted: Tue Dec 15, 2009 4:47 am
by priyadarshikunal
Sainath.Srinivasan wrote:I will like to differ from that assumption.

It is probably because your data volume is low that it can be processed in single node in 10 seconds.
The time is increasing from 10 mins to 30 mins, which is quite significant difference to look at startup time and time taken in partitioning.

As Sainath said it does partition the data on single node and then lands it to a single bucket. So, in no scenario it will increase the time by 20 minutes. I still think its because of resource congestion.

I agree with Craig to dump the score, also get the time taken by each operator in both cases and compare the stats.

What about monitoring the resources too in both cases?