node contraints

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
ravi7487
Premium Member
Premium Member
Posts: 25
Joined: Sat Feb 05, 2011 10:03 pm

node contraints

Post by ravi7487 »

Hi,

A default configuration file running on a windows PC with one CPU usually has one node. Can we split that into two nodes by giving something like below? if i have key column value deptno with values like below, will hash partition sends dept 10 values on one node and dept 20 values on another node?

deptno,sal
10,1000
20,300
10,300
20,300




{
node "node1"
{
fastname "WINXP"
pools ""
resource disk "C:/Ascential/DataStage/Datasets" { pools "" }
resource scratchdisk "C:/Ascential/DataStage/Scratch" { pools "" }
}

node "node2"
{
fastname "WINXP"
pools ""
resource disk "C:/Ascential/DataStage/Datasets" { pools "" }
resource scratchdisk "C:/Ascential/DataStage/Scratch" { pools "" }
}


}
Vidyut
Participant
Posts: 24
Joined: Wed Oct 13, 2010 12:45 am

Post by Vidyut »

As per my knowledge
A node can have one or more CPUs but a CPU can have only one node....
Experts please comment
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

There is no inherent limitation to the number of logical nodes (node entry) you can specify on a single server (fastname entry) within a configuration file. CPU/core allocation for job processes is completely up to the operating system...the parallel engine has no control of that.

There is no absolute guarantee that hash partitioning will split the deptno values as you describe as the results depend upon the (unpublished) hash algorithm used in the engine.

Regards,
- james wiles


All generalizations are false, including this one - Mark Twain.
devesh_ssingh
Participant
Posts: 148
Joined: Thu Apr 10, 2008 12:47 am

Post by devesh_ssingh »

@vidyut..

a single cpu can have multiple logical nodes as well as....

@ravi
yes you can split in to two nodes by defining config file as you described but as cpu is one it won't improve performance of your job..
also data would be splitted on two nodes as you mentioned
one with 10 node1 and 20 on node2 vice versa as per sort available on i/p file

also if i take your post "node constraints" it gives you flexibilty to run the job on perticluar node mentioned by you in constarint as option.
ravi7487
Premium Member
Premium Member
Posts: 25
Joined: Sat Feb 05, 2011 10:03 pm

Post by ravi7487 »

Thank you all for replying.

I have my default config. File as above with two nodes, I did not select any node constraint option in my job( it means running on all available nodes).

I gave system variable @partition number to an output column, I only see partition '1', for dept 10 and 20 values, I am assuming dept 10 values should be on partition 1 and dept20 values on partition 2 . Please suggest
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Bad assumption. All you know is that records with the same hashing keys will end up on the same node but you have no knowledge of which will go where.
-craig

"You can never have too many knives" -- Logan Nine Fingers
ravi7487
Premium Member
Premium Member
Posts: 25
Joined: Sat Feb 05, 2011 10:03 pm

Post by ravi7487 »

Could you please explain chulett, thank you
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

As Craig states, proper partitioning ensures that all records with the same partition key value will end up in the SAME partition. You have no real control over WHICH partition they land in....for all you know, the values 10 and 20 may end up in the same partition, or 10 in partition 2 and 20 in partition 1. Several factors affect this, including datatypes, number of partitions, number of partition keys, partitioning type.

Your 10s and 20s will probably be split into different partitions. The best way for you to verify this is to run a test job to see which partition your data ends up in...a simple peek includes the partition number in the log (as does any stage).

You choose your partitioning keys so that the data is as evenly distributed as possible while meeting the requirements of the logic processing that data. Sometimes, your data may not be as evenly distributed as you would like, but the logic overrules...you want your results to be correct.

Regards,
- james wiles


All generalizations are false, including this one - Mark Twain.
ravi7487
Premium Member
Premium Member
Posts: 25
Joined: Sat Feb 05, 2011 10:03 pm

Post by ravi7487 »

THank you, I used the system variable in transformer stage @INPARTNUM which gives the partition number for the partition keys, al dept10, dept20 values are assigned partition number as 1. So, I am not able to understand this ?
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

Once again, just to make this clear, you have no guarantee that 10 and 20 will be placed in DIFFERENT partitions, just that all of the records for 10 will be placed into the SAME partition and that all of the records for 20 will be placed into the SAME partition (assuming that deptno is the only partition key). 10's and 20's may or may not be placed in the same partition.

Part of your situation is that your example has very few distinct key values. Hash partitioning is most effective when you have a large distribution of distinct values. You may want to try a different partitioning method, such as modulus if you have a single integer key column. It may provide better distribution for your example.

For more information on partitioning, read the Partitioning, repartitioning and collecting data section of Chapter 2 of the Parallel Job Developer Guide.
- james wiles


All generalizations are false, including this one - Mark Twain.
asorrell
Posts: 1707
Joined: Fri Apr 04, 2003 2:00 pm
Location: Colleyville, Texas

Post by asorrell »

Very good response from James Wiles!

Remember - the goal is not to micro-manage the data. You aren't concerned in which partition it gets processed, just that it will be processed correctly, keeping the records together that must be kept together for data integrity purposes.
Andy Sorrell
Certified DataStage Consultant
IBM Analytics Champion 2009 - 2020
greggknight
Premium Member
Premium Member
Posts: 120
Joined: Thu Oct 28, 2004 4:24 pm

Post by greggknight »

Optimizing parallelism
The degree of parallelism of a parallel job is determined by the number of nodes you define when you configure the parallel engine. Parallelism should be optimized for your hardware rather than simply maximized. Increasing parallelism distributes your work load but it also adds to your overhead because the number of processes increases. Increased parallelism can actually hurt performance once you exceed the capacity of your hardware. Therefore you must weigh the gains of added parallelism against the potential losses in processing efficiency.

Obviously, the hardware that makes up your system influences the degree of parallelism you can establish.

SMP systems allow you to scale up the number of CPUs and to run your parallel application against more memory. In general, an SMP system can support multiple logical nodes. Some SMP systems allow scalability of disk I/O. "Configuration Options for an SMP" discusses these considerations.

In a cluster or MPP environment, you can use the multiple CPUs and their associated memory and disk resources in concert to tackle a single computing problem. In general, you have one logical node per CPU on an MPP system. "Configuration Options for an MPP System" describes these issues.

The properties of your system's hardware also determines configuration. For example, applications with large memory requirements, such as sort operations, are best assigned to machines with a lot of memory. Applications that will access an RDBMS must run on its server nodes; and stages using other proprietary software, such as SAS, must run on nodes with licenses for that software.

Here are some additional factors that affect the optimal degree of parallelism:

CPU-intensive applications, which typically perform multiple CPU-demanding operations on each record, benefit from the greatest possible parallelism, up to the capacity supported by your system.
Parallel jobs with large memory requirements can benefit from parallelism if they act on data that has been partitioned and if the required memory is also divided among partitions.
Applications that are disk- or I/O-intensive, such as those that extract data from and load data into RDBMSs, benefit from configurations in which the number of logical nodes equals the number of disk spindles being accessed. For example, if a table is fragmented 16 ways inside a database or if a data set is spread across 16 disk drives, set up a node pool consisting of 16 processing nodes.
For some jobs, especially those that are disk-intensive, you must sometimes configure your system to prevent the RDBMS from having either to redistribute load data or to re-partition the data from an extract operation.
The speed of communication among stages should be optimized by your configuration. For example, jobs whose stages exchange large amounts of data should be assigned to nodes where stages communicate by either shared memory (in an SMP environment) or a high-speed link (in an MPP environment). The relative placement of jobs whose stages share small amounts of data is less important.
For SMPs, you might want to leave some processors for the operating system, especially if your application has many stages in a job. See "Configuration Options for an SMP" .
In an MPP environment, parallelization can be expected to improve the performance of CPU-limited, memory-limited, or disk I/O-limited applications. See "Configuration Options for an MPP System" .
The most nearly-equal partitioning of data contributes to the best overall performance of a job run in parallel. For example, when hash partitioning, try to ensure that the resulting partitions are evenly populated.This is referred to as minimizing skew. Experience is the best teacher. Start with smaller data sets and try different parallelizations while scaling up the data set sizes to collect performance statistics.
"Don't let the bull between you and the fence"

Thanks
Gregg J Knight

"Never Never Never Quit"
Winston Churchill
srireddypunuru
Premium Member
Premium Member
Posts: 40
Joined: Thu Jul 10, 2008 12:45 pm

CPU and its Code and Node

Post by srireddypunuru »

Guys,

We have 2 CPUs, and each CPU has 6 cores and a 24 GB RAM on Windowsbox with DataStage 8.5. The CPUs are also hyperthreaded so when we create a Config File what is the best way to create a Config file in this case.

Right now we have the below Config File.
{
node "node1"
{
fastname "ms979"
pools ""
resource disk "D:/IBM/InformationServer/Server/Datasets" {pools ""}
resource scratchdisk "D:/IBM/InformationServer/Server/Scratch" {pools ""}
}
node "node2"
{
fastname "ms979"
pools ""
resource disk "D:/IBM/InformationServer/Server/Datasets" {pools ""}
resource scratchdisk "D:/IBM/InformationServer/Server/Scratch" {pools ""}
}
Can somebody advice on it.
Thanks
Srikanth Reddy
Integration Consultant
Post Reply