How can you divide memory among partitions?

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
splayer
Charter Member
Charter Member
Posts: 502
Joined: Mon Apr 12, 2004 5:01 pm

How can you divide memory among partitions?

Post by splayer »

Came across this in the Advanced Developer's Guide:

"Parallel jobs with large memory requirements can benefit from parallelism if they act on data that has been partitioned and if the required memory is also divided among partitions"

I know that data can be partitioned by using the partitioning tab but how can you divide the required memory among partitions?

Thanks to anyone who responds.
avi21st
Charter Member
Charter Member
Posts: 135
Joined: Thu May 26, 2005 10:21 am
Location: USA

Configuration File design

Post by avi21st »

Hi

These are some pointers- we need to design the configuarion file for dividing the data into partitions(depending on the logic) and using the concept of parallelism in a Datastage job. Also you need to designa nd allocate appropriate space for resource disk and scratch disk in your configuration file

The configuration file tells DataStage Enterprise Edition how to exploit underlying system resources (processing, temporary storage, and dataset storage). In more advanced environments, the configuration file can also define other resources such as databases and buffer storage. At runtime, it first reads the configuration file to determine what system resources are allocated to it, and then distributes the job flow across these resources. At runtime, the configuration file is specified through the environment variable $APT_CONFIG_FILE.

The Enterprise Edition runs on systems that meet the following requirements:
200 MB of free disk space for product installation
256 MB or more of memory per processing node, depending on the application
At least 500 MB of scratch disk space per processing node


Within a configuration file, the number of processing nodes defines the
degree of parallelism and resources that a particular job will use to run. It is up to the UNIX operating system to actually schedule and run the processes that make up a DataStage job across physical processors. A configuration file with a larger number of nodes generates a larger number of processes that use more memory (and perhaps more disk activity) than a configuration file with a smaller number of nodes.
While the DataStage documentation suggests creating half the number of nodes as physical CPUs, this is a conservative starting point that is highly dependent on system configuration, resource availability, job design, and other applications sharing the server hardware. For example, if a job is highly I/O dependent or dependent on external (eg. database) sources or targets, it may appropriate to have more nodes than physical CPUs.

For typical production environments, a good starting point is to set the number of nodes equal to the number of CPUs. For development environments, which are typically smaller and more resource-constrained, create smaller configuration files (eg. 2-4 nodes).
Avishek Mukherjee
Data Integration Architect
Chicago, IL, USA.
splayer
Charter Member
Charter Member
Posts: 502
Joined: Mon Apr 12, 2004 5:01 pm

Post by splayer »

Thank you Avishekd for your detailed response.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

In the 4-node example above, the order of the disks is purposely shifted for each node, in an attempt to minimize I/O contention
Hi Avishek,
Do you mean that, DS prefers the intensive I/O operation in first disk specified in the list?
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
Post Reply