Reg. Configuration File

chandra.shekhar@tcs.com · Wed Dec 07, 2011 11:33 pm

Hi,
Can any body explain me the following code of the configuration file?
And what will be the difference when "DB2" word is used ??
I am bit slow in understanding the inner logic of the file.

Code: Select all

{
	node "node1_1"
	{
		fastname "brhaspati"
		pools ""
		resource disk "/resource1" {pools ""}
		resource scratchdisk "/scratch1" {pools ""}
	}
node "node1_2"
	{
		fastname "brhaspati"
		pools "DB2"
		resource disk "/resource1" {pools ""}
		resource scratchdisk "/scratch1" {pools ""}
	}
}

ray.wurlod · Post by **ray.wurlod** » Wed Dec 07, 2011 11:58 pm

This configuration offers two nodes, only one of which is in the default node pool (the one with "" as its pool name). The other one is in a node pool called "DB2".

Non-DB2 stages will, unless specified otherwise, execute in the default node pool. In your configuration that means they will all run sequentially, since there's only one node in the pool.

DB2 stages will automatically seek out a node pool called "DB2" and execute in that. If there is no "DB2" node pool, they will also execute in the default node pool.

As far as I can see this configuration file is a misguided attempt to separate the DB2 processing from the other processing. The problem is that it has sacrificed all the benefits of parallelism to do so, without any gains in overall processing efficiency since all nodes are on the same machine.

If there were two or more processing (default) nodes, and maybe multiple nodes in the "DB2" node pool corresponding to the number of table partitions, then we might have a different story!

chandra.shekhar@tcs.com · Thu Dec 08, 2011 12:34 am

Thanx Ray.
You are correct, actually the file is having 12 defult nodes and 12 "DB2" nodes. Just to understand the logic I pasted only a part of the file's logic.
Then according to you, 12 default nodes will be assigned to non-DB2 stages and 12 nodes fot DB2 stages, am I right?
Now in this scenario, am I achieving parellelism?

zulfi123786 · Post by **zulfi123786** » Thu Dec 08, 2011 3:50 am

Yes you are running on parallel architecture with 12 nodes for processing stages and 12 nodes for DB2 stages

chandra.shekhar@tcs.com · Thu Dec 08, 2011 4:16 am

Thanx Zulfi.
So what do you think in which scenario a normal job will work faster -using 24 default nodes or 24 mixed(12 default and 12 DB2 nodes) ?
Consider a job

Code: Select all

 Seq File -->Tfr-->DB2 Connector

Src is having around 100 million records.
Only Null/Valid Date check happens in the Tfr and that too for 10% columns..
What to you say?

zulfi123786 · Post by **zulfi123786** » Thu Dec 08, 2011 1:16 pm

chandra.shekhar@tcs.com wrote: So what do you think in which scenario a normal job will work faster -using 24 default nodes or 24 mixed(12 default and 12 DB2 nodes) ?

What to you say?

To answer the above, there is a lot to say

Increasing the number of nodes on and on wont make your job run faster...... you need to understand your hardware to decide how far you can dwell in parallelism.

Adding too many nodes will increase the overhead of managing the numerous process.

If you are not sure of what lies under the hood then perform a trial and error run to find what is the best number of nodes for optimal performance (which in your case you define as speed of processing), beware that this node count would again depend on the varying load of on the server when the test is performed.

ray.wurlod · Post by **ray.wurlod** » Thu Dec 08, 2011 3:15 pm

See if you can talk to a site that is using a comparably sized configuration, for example Target Corporation (they have offices in Minneapolis and Bangalore). One of their configurations has 10 processing nodes and 24 DB2 nodes (12 for reading, 12 for writing).

chandra.shekhar@tcs.com · Fri Dec 09, 2011 12:30 am

Thanx Zulfi and Ray for your responses.
I will test and let you know.