Is there a problem with just using default APT_CONFIG file??

whenry6000 · Post by **whenry6000** » Wed Sep 12, 2007 8:31 am

All,
I am using Datstage parallel edition. It was recently installed and we are still using the default configuration file. I was under the inpression that the default configuration file is sufficient, but may not process efficiently, whereas others are saying it will lead to problems if we don't update it.
If we are in a multi-processor environment,is there any problem using the default single-node configuration file??

fridge · Post by **fridge** » Wed Sep 12, 2007 8:39 am

one issue you need to be aware of is the the default config that is created as an installation - it will 'point' to the engine directory to store DataSets and Scratch Files - this is not a good idea as the installation directory needs to be protected from filling up - as a minimum this should be changed to point to true 'application' directories

Maveric · Post by **Maveric** » Wed Sep 12, 2007 8:44 am

Using a one node config file in parallel job will make the job run in sequential mode(on one node). So the main advantage of parallelism in PX is lost.

kwwilliams · Post by **kwwilliams** » Wed Sep 12, 2007 8:46 am

You should always develop on at leat a two node configuration file. The power of DataStage is that you can give your jobs more or less horsepower just by changing the ocnfiguration file. However there are some design considerations that must be made when creating a job to run on more than one node, namely partitioning.

In single node all of your jobs are essentially running in sequential mode. On two or more nodes, the job will break up the data into partitions that are defined within the job design. It will send different records to different nodes, if you are going to join, aggregate or sort you need like data on the same node. Without developing the jobs on muultiple nodes I gurantee that you will not get your partitioning set up correctly to allow use of multiple nodes. Take an aggregation stage. You have data set, A,B,C,A,B,B,A,D. Your aggregator is going to perform a count. In single node it will look like A-3,B-3,C-1,D-1. On two nodes using the default partitioning you would get A-2,A-1,B-2,B-1,C-1,D-1. The data on one partition is not aware of the data on another partition. This example is very simple of the problem you would run into. Even if you set up the partitioning correctly, you would not be testing it and utilizing it running single node, so you would not know if you truly had it set up correctly.

whenry6000 · Post by **whenry6000** » Wed Sep 12, 2007 10:40 am

Thank you all for the responses. I realie that you lose power and the ability to control the flow of your ETL, but if I leave the partitioning set to 'Auto', do i have to configure the APT_CONFIG file??

kwwilliams wrote:You should always develop on at leat a two node configuration file. The power of DataStage is that you can give your jobs more or less horsepower just by changing the ocnfiguration file. However there are some design considerations that must be made when creating a job to run on more than one node, namely partitioning.

In single node all of your jobs are essentially running in sequential mode. On two or more nodes, the job will break up the data into partitions that are defined within the job design. It will send different records to different nodes, if you are going to join, aggregate or sort you need like data on the same node. Without developing the jobs on muultiple nodes I gurantee that you will not get your partitioning set up correctly to allow use of multiple nodes. Take an aggregation stage. You have data set, A,B,C,A,B,B,A,D. Your aggregator is going to perform a count. In single node it will look like A-3,B-3,C-1,D-1. On two nodes using the default partitioning you would get A-2,A-1,B-2,B-1,C-1,D-1. The data on one partition is not aware of the data on another partition. This example is very simple of the problem you would run into. Even if you set up the partitioning correctly, you would not be testing it and utilizing it running single node, so you would not know if you truly had it set up correctly.

kwwilliams · Post by **kwwilliams** » Wed Sep 12, 2007 10:51 am

Understand that auto is a choice, do you understand what auto will do? For many stages it is going to pick round robin which is the scenario that I played out for you in my example. If you ever move to more than one node your jobs are going to have to be reexamined to ensure the quality of the data. Without being able to predict the future needs of the job or the company it is a better idea to start with two nodes rather than one. In doing code reviews, and mentoring other developers the number one issue I see is poor data quality due to data partitioning.

Understand that auto is a choice and make sure you understand the implications of that choice. You are likely locking in your jobs to always run on a single node, I would not be comfortable with that, but it is unltimately up to you and your company to make that choice.

whenry6000 · Post by **whenry6000** » Wed Sep 12, 2007 11:22 am

That makes sense. So, a secondary question: what information do I need from the UNIX system administrator to configure properly? I am not familiar with setting up a configuration file and need a clearer idea of what the various pieces mean.

Is there any documentation or other posts outside of the Datastage documentation that comes with the install??

Thanks!

kwwilliams wrote:Understand that auto is a choice, do you understand what auto will do? For many stages it is going to pick round robin which is the scenario that I played out for you in my example. If you ever move to more than one node your jobs are going to have to be reexamined to ensure the quality of the data. Without being able to predict the future needs of the job or the company it is a better idea to start with two nodes rather than one. In doing code reviews, and mentoring other developers the number one issue I see is poor data quality due to data partitioning.

Understand that auto is a choice and make sure you understand the implications of that choice. You are likely locking in your jobs to always run on a single node, I would not be comfortable with that, but it is unltimately up to you and your company to make that choice.

kwwilliams · Post by **kwwilliams** » Wed Sep 12, 2007 12:05 pm

There is some documentation floating around, I can't seem to find it at the moment. I prefer the old orchestrate manuals for delving into the details of how this thing works. I have them stored off, and I can't remember the original source. You can read the help file from manager for some explanations.

I can continue to answer questions, but feel like I should ask one first. Has your company purchased consulting to help in setup? There are a lot of mistakes that you can make in setting up your environment right out of the gate. There are a lot of decisions that I wish I could go back several years and take back about how our environment is set up.

You need to understand the following:

Where are you going to store your data sets (you need at least two and they should be on different disks)?
What is your scratch location (you need at least two and they should be on different disks)?
If you have input files, make sure that the data set (resource disks) and scratch locations are not on the same disks. This will cause contention and slower jobs because of it.

{

node "node0" {
fastname "Insert result of unix uname -n"
pools ""
resource disk "first dataset location" { pools "" }
resource disk "second dataset location { pools ""}
resource scratchdisk "first scratch location"
resource scratchdisk "2nd scratch location"
}
node "node1" {
fastname "Insert result of unix uname -n"
pools ""
resource disk "first dataset location" { pools "" }
resource disk "second dataset location { pools ""}
resource scratchdisk "first scratch location"
resource scratchdisk "2nd scratch location"
}
}

They can be much more elaborate and planned than that, which is why I would strongly suggest some help in your initial start up. That is a pretty generic config that may not be appropriate for your situation.

Minhajuddin · Post by **Minhajuddin** » Wed Sep 12, 2007 12:10 pm

whenry6000 wrote:All,
I am using Datstage parallel edition. It was recently installed and we are still using the default configuration file. I was under the inpression that the default configuration file is sufficient, but may not process efficiently, whereas others are saying it will lead to problems if we don't update it.
If we are in a multi-processor environment,is there any problem using the default single-node configuration file??

Not changing the Configuration file won't create problems!

But, again you wouldn't be using all your resources efficiently.

A typical configuration file has information like
fastname, scratch disk and temp disk. If your server is not a cluster of computers, you can just copy the contents of node1 in the config file and paste it again and again by changing the header to node2, node3 and so on... save it with a name.

To use this you have to point the $APT_CONFIG_FILE env variable to the path of this config file.

IBM Analytics Champion 2009 - 2020 · Post by **asorrell** » Wed Sep 12, 2007 12:11 pm

FYI - the configuration topic takes more than half a day to cover in the Server to PX sessions the DSXChange has been hosting. This can be a moderately involved topic depending on your setup. I highly recommend you read the documentation in the Admin guide, as it does have quite a bit of info.

With that said, a quick guide to what you can start with:

1) Configure one node for every two CPU's on your system, with a minimum of two nodes total. For example - if you have 8 CPU's on your system configure it with four nodes.

2) Configure at least one large scratch area using the "resource scratchdisk". All nodes can share this if they have to, though if they each have their own scratch area, even better.

3) Configure your resource disks for your PX Datasets using the "resource disk" config directive. Again, each node can share the same resource disk (and usually do in development), but in production each node should have a separate area. This is very important to insure your PX jobs don't bottleneck on I/O as they output to datasets.

There's lots more that can be done, especially if you are using a partitioned database for storage / retrieval. However doing the first three items should get you started.

Two words of advice:

1) Don't experiment with the default config file! Mess that up and all your jobs stop running! Save your config changes under another name and test it by setting the $APT_CONFIG_FILE environment variable in a test job.

2) Remember to "check" your changes once you've saved them using the configuration editor in the Manager.

Good luck!

whenry6000 · Post by **whenry6000** » Wed Sep 12, 2007 12:46 pm

Unfortunately, no we haven't.

kwwilliams wrote: I can continue to answer questions, but feel like I should ask one first. Has your company purchased consulting to help in setup?

kwwilliams · Post by **kwwilliams** » Wed Sep 12, 2007 1:51 pm

Sounds like that decision is not up to you. Sorry to hear that. Are they springing for training or telling you it is on the job training. Not to be a commercial, but we had Ray come onsite here to provide training and it was very beneficial (better than the official IBM training).

ArndW · Post by **ArndW** » Wed Sep 12, 2007 3:11 pm

One thing I always stress when it comes to PX configuration files is that you should always develop and test jobs with at least a 2-node configuration file. Any job that runs correctly on a 2-node configuration will scale to n-nodes or even a 1-node configuration. If you use a 1-node configuration file in development or as the default configuration file you have a fairly good chance of seeing it function differently when the job uses a multinode configuration.

Many job will perform best with a 1-node configuration, but still should be designed correctly so that they are scalable.

DSXchange

Is there a problem with just using default APT_CONFIG file??

Is there a problem with just using default APT_CONFIG file??

Re: Is there a problem with just using default APT_CONFIG fi

Re: Is there a problem with just using default APT_CONFIG fi

Re: Is there a problem with just using default APT_CONFIG fi

Re: Is there a problem with just using default APT_CONFIG fi

Re: Is there a problem with just using default APT_CONFIG fi

Re: Is there a problem with just using default APT_CONFIG fi

Re: Is there a problem with just using default APT_CONFIG fi

Re: Is there a problem with just using default APT_CONFIG fi