node contraints
Moderators: chulett, rschirm, roy
node contraints
Hi,
A default configuration file running on a windows PC with one CPU usually has one node. Can we split that into two nodes by giving something like below? if i have key column value deptno with values like below, will hash partition sends dept 10 values on one node and dept 20 values on another node?
deptno,sal
10,1000
20,300
10,300
20,300
{
node "node1"
{
fastname "WINXP"
pools ""
resource disk "C:/Ascential/DataStage/Datasets" { pools "" }
resource scratchdisk "C:/Ascential/DataStage/Scratch" { pools "" }
}
node "node2"
{
fastname "WINXP"
pools ""
resource disk "C:/Ascential/DataStage/Datasets" { pools "" }
resource scratchdisk "C:/Ascential/DataStage/Scratch" { pools "" }
}
}
A default configuration file running on a windows PC with one CPU usually has one node. Can we split that into two nodes by giving something like below? if i have key column value deptno with values like below, will hash partition sends dept 10 values on one node and dept 20 values on another node?
deptno,sal
10,1000
20,300
10,300
20,300
{
node "node1"
{
fastname "WINXP"
pools ""
resource disk "C:/Ascential/DataStage/Datasets" { pools "" }
resource scratchdisk "C:/Ascential/DataStage/Scratch" { pools "" }
}
node "node2"
{
fastname "WINXP"
pools ""
resource disk "C:/Ascential/DataStage/Datasets" { pools "" }
resource scratchdisk "C:/Ascential/DataStage/Scratch" { pools "" }
}
}
There is no inherent limitation to the number of logical nodes (node entry) you can specify on a single server (fastname entry) within a configuration file. CPU/core allocation for job processes is completely up to the operating system...the parallel engine has no control of that.
There is no absolute guarantee that hash partitioning will split the deptno values as you describe as the results depend upon the (unpublished) hash algorithm used in the engine.
Regards,
There is no absolute guarantee that hash partitioning will split the deptno values as you describe as the results depend upon the (unpublished) hash algorithm used in the engine.
Regards,
- james wiles
All generalizations are false, including this one - Mark Twain.
All generalizations are false, including this one - Mark Twain.
-
- Participant
- Posts: 148
- Joined: Thu Apr 10, 2008 12:47 am
@vidyut..
a single cpu can have multiple logical nodes as well as....
@ravi
yes you can split in to two nodes by defining config file as you described but as cpu is one it won't improve performance of your job..
also data would be splitted on two nodes as you mentioned
one with 10 node1 and 20 on node2 vice versa as per sort available on i/p file
also if i take your post "node constraints" it gives you flexibilty to run the job on perticluar node mentioned by you in constarint as option.
a single cpu can have multiple logical nodes as well as....
@ravi
yes you can split in to two nodes by defining config file as you described but as cpu is one it won't improve performance of your job..
also data would be splitted on two nodes as you mentioned
one with 10 node1 and 20 on node2 vice versa as per sort available on i/p file
also if i take your post "node constraints" it gives you flexibilty to run the job on perticluar node mentioned by you in constarint as option.
Thank you all for replying.
I have my default config. File as above with two nodes, I did not select any node constraint option in my job( it means running on all available nodes).
I gave system variable @partition number to an output column, I only see partition '1', for dept 10 and 20 values, I am assuming dept 10 values should be on partition 1 and dept20 values on partition 2 . Please suggest
I have my default config. File as above with two nodes, I did not select any node constraint option in my job( it means running on all available nodes).
I gave system variable @partition number to an output column, I only see partition '1', for dept 10 and 20 values, I am assuming dept 10 values should be on partition 1 and dept20 values on partition 2 . Please suggest
As Craig states, proper partitioning ensures that all records with the same partition key value will end up in the SAME partition. You have no real control over WHICH partition they land in....for all you know, the values 10 and 20 may end up in the same partition, or 10 in partition 2 and 20 in partition 1. Several factors affect this, including datatypes, number of partitions, number of partition keys, partitioning type.
Your 10s and 20s will probably be split into different partitions. The best way for you to verify this is to run a test job to see which partition your data ends up in...a simple peek includes the partition number in the log (as does any stage).
You choose your partitioning keys so that the data is as evenly distributed as possible while meeting the requirements of the logic processing that data. Sometimes, your data may not be as evenly distributed as you would like, but the logic overrules...you want your results to be correct.
Regards,
Your 10s and 20s will probably be split into different partitions. The best way for you to verify this is to run a test job to see which partition your data ends up in...a simple peek includes the partition number in the log (as does any stage).
You choose your partitioning keys so that the data is as evenly distributed as possible while meeting the requirements of the logic processing that data. Sometimes, your data may not be as evenly distributed as you would like, but the logic overrules...you want your results to be correct.
Regards,
- james wiles
All generalizations are false, including this one - Mark Twain.
All generalizations are false, including this one - Mark Twain.
Once again, just to make this clear, you have no guarantee that 10 and 20 will be placed in DIFFERENT partitions, just that all of the records for 10 will be placed into the SAME partition and that all of the records for 20 will be placed into the SAME partition (assuming that deptno is the only partition key). 10's and 20's may or may not be placed in the same partition.
Part of your situation is that your example has very few distinct key values. Hash partitioning is most effective when you have a large distribution of distinct values. You may want to try a different partitioning method, such as modulus if you have a single integer key column. It may provide better distribution for your example.
For more information on partitioning, read the Partitioning, repartitioning and collecting data section of Chapter 2 of the Parallel Job Developer Guide.
Part of your situation is that your example has very few distinct key values. Hash partitioning is most effective when you have a large distribution of distinct values. You may want to try a different partitioning method, such as modulus if you have a single integer key column. It may provide better distribution for your example.
For more information on partitioning, read the Partitioning, repartitioning and collecting data section of Chapter 2 of the Parallel Job Developer Guide.
- james wiles
All generalizations are false, including this one - Mark Twain.
All generalizations are false, including this one - Mark Twain.
-
- Premium Member
- Posts: 120
- Joined: Thu Oct 28, 2004 4:24 pm
Optimizing parallelism
The degree of parallelism of a parallel job is determined by the number of nodes you define when you configure the parallel engine. Parallelism should be optimized for your hardware rather than simply maximized. Increasing parallelism distributes your work load but it also adds to your overhead because the number of processes increases. Increased parallelism can actually hurt performance once you exceed the capacity of your hardware. Therefore you must weigh the gains of added parallelism against the potential losses in processing efficiency.
Obviously, the hardware that makes up your system influences the degree of parallelism you can establish.
SMP systems allow you to scale up the number of CPUs and to run your parallel application against more memory. In general, an SMP system can support multiple logical nodes. Some SMP systems allow scalability of disk I/O. "Configuration Options for an SMP" discusses these considerations.
In a cluster or MPP environment, you can use the multiple CPUs and their associated memory and disk resources in concert to tackle a single computing problem. In general, you have one logical node per CPU on an MPP system. "Configuration Options for an MPP System" describes these issues.
The properties of your system's hardware also determines configuration. For example, applications with large memory requirements, such as sort operations, are best assigned to machines with a lot of memory. Applications that will access an RDBMS must run on its server nodes; and stages using other proprietary software, such as SAS, must run on nodes with licenses for that software.
Here are some additional factors that affect the optimal degree of parallelism:
CPU-intensive applications, which typically perform multiple CPU-demanding operations on each record, benefit from the greatest possible parallelism, up to the capacity supported by your system.
Parallel jobs with large memory requirements can benefit from parallelism if they act on data that has been partitioned and if the required memory is also divided among partitions.
Applications that are disk- or I/O-intensive, such as those that extract data from and load data into RDBMSs, benefit from configurations in which the number of logical nodes equals the number of disk spindles being accessed. For example, if a table is fragmented 16 ways inside a database or if a data set is spread across 16 disk drives, set up a node pool consisting of 16 processing nodes.
For some jobs, especially those that are disk-intensive, you must sometimes configure your system to prevent the RDBMS from having either to redistribute load data or to re-partition the data from an extract operation.
The speed of communication among stages should be optimized by your configuration. For example, jobs whose stages exchange large amounts of data should be assigned to nodes where stages communicate by either shared memory (in an SMP environment) or a high-speed link (in an MPP environment). The relative placement of jobs whose stages share small amounts of data is less important.
For SMPs, you might want to leave some processors for the operating system, especially if your application has many stages in a job. See "Configuration Options for an SMP" .
In an MPP environment, parallelization can be expected to improve the performance of CPU-limited, memory-limited, or disk I/O-limited applications. See "Configuration Options for an MPP System" .
The most nearly-equal partitioning of data contributes to the best overall performance of a job run in parallel. For example, when hash partitioning, try to ensure that the resulting partitions are evenly populated.This is referred to as minimizing skew. Experience is the best teacher. Start with smaller data sets and try different parallelizations while scaling up the data set sizes to collect performance statistics.
The degree of parallelism of a parallel job is determined by the number of nodes you define when you configure the parallel engine. Parallelism should be optimized for your hardware rather than simply maximized. Increasing parallelism distributes your work load but it also adds to your overhead because the number of processes increases. Increased parallelism can actually hurt performance once you exceed the capacity of your hardware. Therefore you must weigh the gains of added parallelism against the potential losses in processing efficiency.
Obviously, the hardware that makes up your system influences the degree of parallelism you can establish.
SMP systems allow you to scale up the number of CPUs and to run your parallel application against more memory. In general, an SMP system can support multiple logical nodes. Some SMP systems allow scalability of disk I/O. "Configuration Options for an SMP" discusses these considerations.
In a cluster or MPP environment, you can use the multiple CPUs and their associated memory and disk resources in concert to tackle a single computing problem. In general, you have one logical node per CPU on an MPP system. "Configuration Options for an MPP System" describes these issues.
The properties of your system's hardware also determines configuration. For example, applications with large memory requirements, such as sort operations, are best assigned to machines with a lot of memory. Applications that will access an RDBMS must run on its server nodes; and stages using other proprietary software, such as SAS, must run on nodes with licenses for that software.
Here are some additional factors that affect the optimal degree of parallelism:
CPU-intensive applications, which typically perform multiple CPU-demanding operations on each record, benefit from the greatest possible parallelism, up to the capacity supported by your system.
Parallel jobs with large memory requirements can benefit from parallelism if they act on data that has been partitioned and if the required memory is also divided among partitions.
Applications that are disk- or I/O-intensive, such as those that extract data from and load data into RDBMSs, benefit from configurations in which the number of logical nodes equals the number of disk spindles being accessed. For example, if a table is fragmented 16 ways inside a database or if a data set is spread across 16 disk drives, set up a node pool consisting of 16 processing nodes.
For some jobs, especially those that are disk-intensive, you must sometimes configure your system to prevent the RDBMS from having either to redistribute load data or to re-partition the data from an extract operation.
The speed of communication among stages should be optimized by your configuration. For example, jobs whose stages exchange large amounts of data should be assigned to nodes where stages communicate by either shared memory (in an SMP environment) or a high-speed link (in an MPP environment). The relative placement of jobs whose stages share small amounts of data is less important.
For SMPs, you might want to leave some processors for the operating system, especially if your application has many stages in a job. See "Configuration Options for an SMP" .
In an MPP environment, parallelization can be expected to improve the performance of CPU-limited, memory-limited, or disk I/O-limited applications. See "Configuration Options for an MPP System" .
The most nearly-equal partitioning of data contributes to the best overall performance of a job run in parallel. For example, when hash partitioning, try to ensure that the resulting partitions are evenly populated.This is referred to as minimizing skew. Experience is the best teacher. Start with smaller data sets and try different parallelizations while scaling up the data set sizes to collect performance statistics.
"Don't let the bull between you and the fence"
Thanks
Gregg J Knight
"Never Never Never Quit"
Winston Churchill
Thanks
Gregg J Knight
"Never Never Never Quit"
Winston Churchill
-
- Premium Member
- Posts: 40
- Joined: Thu Jul 10, 2008 12:45 pm
CPU and its Code and Node
Guys,
We have 2 CPUs, and each CPU has 6 cores and a 24 GB RAM on Windowsbox with DataStage 8.5. The CPUs are also hyperthreaded so when we create a Config File what is the best way to create a Config file in this case.
Right now we have the below Config File.
{
node "node1"
{
fastname "ms979"
pools ""
resource disk "D:/IBM/InformationServer/Server/Datasets" {pools ""}
resource scratchdisk "D:/IBM/InformationServer/Server/Scratch" {pools ""}
}
node "node2"
{
fastname "ms979"
pools ""
resource disk "D:/IBM/InformationServer/Server/Datasets" {pools ""}
resource scratchdisk "D:/IBM/InformationServer/Server/Scratch" {pools ""}
}
Can somebody advice on it.
Thanks
We have 2 CPUs, and each CPU has 6 cores and a 24 GB RAM on Windowsbox with DataStage 8.5. The CPUs are also hyperthreaded so when we create a Config File what is the best way to create a Config file in this case.
Right now we have the below Config File.
{
node "node1"
{
fastname "ms979"
pools ""
resource disk "D:/IBM/InformationServer/Server/Datasets" {pools ""}
resource scratchdisk "D:/IBM/InformationServer/Server/Scratch" {pools ""}
}
node "node2"
{
fastname "ms979"
pools ""
resource disk "D:/IBM/InformationServer/Server/Datasets" {pools ""}
resource scratchdisk "D:/IBM/InformationServer/Server/Scratch" {pools ""}
}
Can somebody advice on it.
Thanks
Srikanth Reddy
Integration Consultant
Integration Consultant