DataStage DataSet usage in a grid environment

pneumalin · Post by **pneumalin** » Wed Mar 09, 2011 5:27 pm

My Apology if I double post this topic by accident, since I cannot find mhy first post a while ago..
>>>>>>>>>>>>
Dear Friend,
I wonder if anyone can advise the best practise of using DataSet in a grid environment. My intitial question here is how do I create and maintain a persistent DataSet in a grid environment?
For instance, I have a DS job created a dataset in 10 nodes for yesterday's job run, but in today's job run there are only 8 nodes are available from Resource Manager's response and therefore the configuration file of 8 nodes is dynamically generated for today's run. How can the DS job in today's run READ the DataSet created by yesterday's run that splitted into 10 nodes? Appreciated if anyone can comment on it if you ever encountered this scenario and found a best way to deal with it. Thanks advanced!

jwiles · Post by **jwiles** » Wed Mar 09, 2011 7:06 pm

For IS 8.1 and above, set the environment variable LOADL_PROCESSOR_LIST=1.

For IS/DS <8.1, your config file needs to include read-only partitions (that are not used as compute partitions--usually by removing the default node pool "" for the read-only nodes, while leaving it in for compute nodes) and node names should match those contained within the config file which created the dataset.
This will also work for 8.1+ in lieu of using LOADL_PROCESSOR_LIST

Regards,

lstsaur · Post by **lstsaur** » Wed Mar 09, 2011 7:14 pm

First, let me clarify how does Resource Manager work in a grid env. If your job's parameter (GRID_NODES) requests for 10 nodes, the Resource Manager will not release your job unless 10 nodes are available. So, it's incorrect as you said that today's job run on 8 nodes because it has only 8 nodes available. If you need to read your datasets created by yesterday's run (10 nodes), your job's parameters must have the exactly the same values as yesterday's run.

jwiles · Post by **jwiles** » Wed Mar 09, 2011 8:01 pm

Some resource managers have the capability and may be configured to return fewer resources than requested if the requested is not available at submission time. In that situation, you will need to use one of the methods mentioned previously in order to process the full dataset.

Regards,

pneumalin · Post by **pneumalin** » Wed Mar 09, 2011 9:19 pm

Many thanks to your comment, it's very infromational! Let me re-phrase my question as followings to clarify some of my doubts:
If I set GRID_NODES=10 yesterday and created dataset in node 1 to 10. For today's run, I still set GRID_NODES=10. But the Resource Manager returns me node 1 to 8, 11, and 12 since node 9, 10 are not available. What would happen in this scenario? Would LOADL_PROCESSOR_LIST=1 be the best practice to deal with DataSet management in Grid?

Since I just went through the Deploying Grid Solution redbook and have not dived into the DataStage Grid implementation details yet, appreciated if you can advise me which documentations might reveal all the environment varaible settings such as GRID_NODES and LOADL_PROCESSOR_LIST. Thanks again!

lstsaur · Post by **lstsaur** » Thu Mar 10, 2011 12:11 am

Lin,
I was talking about using the resource manager software, PBS Pro. LOADL_PROCESSOR_LIST environment variable is for IBM's LoadLeveler. For PBS Pro, it just wouldn't happen when the job's APT_GRID_NODES parameter is specified for 10 nodes, but it ran the job with less nodes. PBS Pro is very specific based on the job's parameters, such as nodes, partitions, queue, etc. for execution.

Assuming your job's queue is authorized to run on all 12 nodes, job ran successfully yesterday and created datasets on node1 to node10, but today node9 and node10 are down. So when you submit the same job with 10 nodes, you will see the message in the job's status "waiting for resource manager" right away if the job's input is trying to READ the datasets created yesterday on node1 to node10. However, if job's input source is diffierent, then the job with 10 nodes parameter will still run because resource manger knows there are still 10 of 12 nodes available.

PBS Pro was recommended by my previous employer, NASA. It's very easy to use and powerful.

jwiles · Post by **jwiles** » Thu Mar 10, 2011 12:18 am

Setting LOADL_PROCESSOR_LIST=1 is recommended for situations where your Resource Manager is configured to return a fewer number of nodes than requested if the resources are not available. In your corrected example, you have still been given 10 nodes (not 8), they are apparently just physically different nodes than used in the previous day's job run. So long as all of your nodes have access to the storage where the dataset data files are located your job should run no matter which nodes are used.

In most grid installations, the resource managers are configured to hold the submitted job until the requested resources are available. This is probably the default action for most if not all of the resource managers used with IS and fits with lstsaur's description of how a resource manager works with a grid.

The current redbook was written before IS 8.1 and therefore doesn't reference the LOADL_PROCESSOR_LIST variable. Support for it was added in IS 8.1.

If your grid was built using the IBM Services Grid Toolkit (usually done through a services Grid Workshop), information on Grid-specific environment variables should have been provided in documentation with that and may be available from the system admins. The APT_GRID* variables are specific to the Grid Toolkit and are not part of the IS product itself. If the grid was built without the Grid Toolkit then I can't say, as anything would be specific to the tools used to build that grid. IS Developer Guides, DSXchange and other sites, coworkers and experience are the best sources.

Regards,

nareshketepalli · Post by **nareshketepalli** » Thu Mar 10, 2011 12:31 am

Hi can you tell me what is grid environment ?

Regards,
Naresh

jwiles · Post by **jwiles** » Thu Mar 10, 2011 12:54 am

LOADL_PROCESSOR_LIST is indeed a LoadLeveler environment variable, but as of version 8.1, IS recognizes it's use specifically for this type of situation. You shouldn't have to run LoadLeveler to utilize the variable in this way, which is why I never mentioned LoadLeveler in my comments. Recent releases of the Grid Toolkit set it at job runtime.

chulett · Post by **chulett** » Thu Mar 10, 2011 8:00 am

nareshketepalli wrote:Hi can you tell me what is grid environment ?

http://en.wikipedia.org/wiki/Grid_computing

jwiles · Post by **jwiles** » Thu Mar 10, 2011 2:07 pm

From a DataStage perspective, a Grid is a cluster in which processing nodes are dynamically allocated at submission, rather than using static configuration files.

PaulVL · Post by **PaulVL** » Thu Mar 10, 2011 6:47 pm

If you wish a more consistent number of "Nodes", then you should cut down on APT_GRID_NODES and go with more partitions per node. APT_GRID_PARTITIONS. Your Grid Resource Manager will only See your APT_GRID_NODE values as the number of hostnames to return from your GRID. DataStage then takes that value and makes a dynamic APT file for you. If you want to have your datastage job be parallelized 8 ways, then request 4 nodes 2 partitions per node. Or 2 nodes, 4 partitions per node. Sounds like you are doing 8 nodes 1 partition per node.

Your GRM is returning a variable amount of nodes so your datastage jobs are failing because of your dataset parallelism.

Try the 2*4 or 4*2 settings.

jwiles · Post by **jwiles** » Thu Mar 10, 2011 7:19 pm

It's the number of logical nodes (degree of parallelism) rather than the number of physical nodes that affects this. If you created a 4-partition dataset with a 1x4 job, then tried to read it with a 1x2 job, your job would read only two of the four partitions unless you use one of the methods I mentioned earlier.

Regards,

pneumalin · Post by **pneumalin** » Thu Mar 10, 2011 11:08 pm

You guys are amazing! Thanks for all the tips, and I will post my findings as well once I get something going.

pneumalin · Post by **pneumalin** » Sat Mar 12, 2011 12:10 pm

James,
Thanks for clarification on LoadLever, and we will use this as our Resource Manager in Grid. With all the friends' great comment on this topic, I still want to test it myself and post what I find later on. I am still not sure if the DataSet Raw Data File created in compute nodes, and how DS Engine can address them if the compute nodes become unavailable in next run since the raw data file name contains the node name itself. Maybe I should create DataSet in front and back end node only. I will let you guys know after I test it.

DSXchange

DataStage DataSet usage in a grid environment

DataStage DataSet usage in a grid environment

Re: DataStage DataSet usage in a grid environment

Re: DataStage DataSet usage in a grid environment