DataStage DataSet usage in a grid environment

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

pneumalin
Premium Member
Premium Member
Posts: 125
Joined: Sat May 07, 2005 6:32 am

DataStage DataSet usage in a grid environment

Post by pneumalin »

My Apology if I double post this topic by accident, since I cannot find mhy first post a while ago..
>>>>>>>>>>>>
Dear Friend,
I wonder if anyone can advise the best practise of using DataSet in a grid environment. My intitial question here is how do I create and maintain a persistent DataSet in a grid environment?
For instance, I have a DS job created a dataset in 10 nodes for yesterday's job run, but in today's job run there are only 8 nodes are available from Resource Manager's response and therefore the configuration file of 8 nodes is dynamically generated for today's run. How can the DS job in today's run READ the DataSet created by yesterday's run that splitted into 10 nodes? Appreciated if anyone can comment on it if you ever encountered this scenario and found a best way to deal with it. Thanks advanced!
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

For IS 8.1 and above, set the environment variable LOADL_PROCESSOR_LIST=1.

For IS/DS <8.1, your config file needs to include read-only partitions (that are not used as compute partitions--usually by removing the default node pool "" for the read-only nodes, while leaving it in for compute nodes) and node names should match those contained within the config file which created the dataset.
This will also work for 8.1+ in lieu of using LOADL_PROCESSOR_LIST

Regards,
- james wiles


All generalizations are false, including this one - Mark Twain.
lstsaur
Participant
Posts: 1139
Joined: Thu Oct 21, 2004 9:59 pm

Post by lstsaur »

First, let me clarify how does Resource Manager work in a grid env. If your job's parameter (GRID_NODES) requests for 10 nodes, the Resource Manager will not release your job unless 10 nodes are available. So, it's incorrect as you said that today's job run on 8 nodes because it has only 8 nodes available. If you need to read your datasets created by yesterday's run (10 nodes), your job's parameters must have the exactly the same values as yesterday's run.
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

Some resource managers have the capability and may be configured to return fewer resources than requested if the requested is not available at submission time. In that situation, you will need to use one of the methods mentioned previously in order to process the full dataset.

Regards,
- james wiles


All generalizations are false, including this one - Mark Twain.
pneumalin
Premium Member
Premium Member
Posts: 125
Joined: Sat May 07, 2005 6:32 am

Post by pneumalin »

Many thanks to your comment, it's very infromational! Let me re-phrase my question as followings to clarify some of my doubts:
If I set GRID_NODES=10 yesterday and created dataset in node 1 to 10. For today's run, I still set GRID_NODES=10. But the Resource Manager returns me node 1 to 8, 11, and 12 since node 9, 10 are not available. What would happen in this scenario? Would LOADL_PROCESSOR_LIST=1 be the best practice to deal with DataSet management in Grid?

Since I just went through the Deploying Grid Solution redbook and have not dived into the DataStage Grid implementation details yet, appreciated if you can advise me which documentations might reveal all the environment varaible settings such as GRID_NODES and LOADL_PROCESSOR_LIST. Thanks again!
lstsaur
Participant
Posts: 1139
Joined: Thu Oct 21, 2004 9:59 pm

Post by lstsaur »

Lin,
I was talking about using the resource manager software, PBS Pro. LOADL_PROCESSOR_LIST environment variable is for IBM's LoadLeveler. For PBS Pro, it just wouldn't happen when the job's APT_GRID_NODES parameter is specified for 10 nodes, but it ran the job with less nodes. PBS Pro is very specific based on the job's parameters, such as nodes, partitions, queue, etc. for execution.

Assuming your job's queue is authorized to run on all 12 nodes, job ran successfully yesterday and created datasets on node1 to node10, but today node9 and node10 are down. So when you submit the same job with 10 nodes, you will see the message in the job's status "waiting for resource manager" right away if the job's input is trying to READ the datasets created yesterday on node1 to node10. However, if job's input source is diffierent, then the job with 10 nodes parameter will still run because resource manger knows there are still 10 of 12 nodes available.

PBS Pro was recommended by my previous employer, NASA. It's very easy to use and powerful.
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

Setting LOADL_PROCESSOR_LIST=1 is recommended for situations where your Resource Manager is configured to return a fewer number of nodes than requested if the resources are not available. In your corrected example, you have still been given 10 nodes (not 8), they are apparently just physically different nodes than used in the previous day's job run. So long as all of your nodes have access to the storage where the dataset data files are located your job should run no matter which nodes are used.

In most grid installations, the resource managers are configured to hold the submitted job until the requested resources are available. This is probably the default action for most if not all of the resource managers used with IS and fits with lstsaur's description of how a resource manager works with a grid.

The current redbook was written before IS 8.1 and therefore doesn't reference the LOADL_PROCESSOR_LIST variable. Support for it was added in IS 8.1.

If your grid was built using the IBM Services Grid Toolkit (usually done through a services Grid Workshop), information on Grid-specific environment variables should have been provided in documentation with that and may be available from the system admins. The APT_GRID* variables are specific to the Grid Toolkit and are not part of the IS product itself. If the grid was built without the Grid Toolkit then I can't say, as anything would be specific to the tools used to build that grid. IS Developer Guides, DSXchange and other sites, coworkers and experience are the best sources.

Regards,
- james wiles


All generalizations are false, including this one - Mark Twain.
nareshketepalli
Participant
Posts: 36
Joined: Mon Jun 28, 2010 11:24 pm
Location: seepz

Re: DataStage DataSet usage in a grid environment

Post by nareshketepalli »

Hi can you tell me what is grid environment ?

Regards,
Naresh
NARESHKUMAR
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

LOADL_PROCESSOR_LIST is indeed a LoadLeveler environment variable, but as of version 8.1, IS recognizes it's use specifically for this type of situation. You shouldn't have to run LoadLeveler to utilize the variable in this way, which is why I never mentioned LoadLeveler in my comments. Recent releases of the Grid Toolkit set it at job runtime.
- james wiles


All generalizations are false, including this one - Mark Twain.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Re: DataStage DataSet usage in a grid environment

Post by chulett »

nareshketepalli wrote:Hi can you tell me what is grid environment ?
http://en.wikipedia.org/wiki/Grid_computing
-craig

"You can never have too many knives" -- Logan Nine Fingers
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

From a DataStage perspective, a Grid is a cluster in which processing nodes are dynamically allocated at submission, rather than using static configuration files.
- james wiles


All generalizations are false, including this one - Mark Twain.
PaulVL
Premium Member
Premium Member
Posts: 1315
Joined: Fri Dec 17, 2010 4:36 pm

Post by PaulVL »

If you wish a more consistent number of "Nodes", then you should cut down on APT_GRID_NODES and go with more partitions per node. APT_GRID_PARTITIONS. Your Grid Resource Manager will only See your APT_GRID_NODE values as the number of hostnames to return from your GRID. DataStage then takes that value and makes a dynamic APT file for you. If you want to have your datastage job be parallelized 8 ways, then request 4 nodes 2 partitions per node. Or 2 nodes, 4 partitions per node. Sounds like you are doing 8 nodes 1 partition per node.

Your GRM is returning a variable amount of nodes so your datastage jobs are failing because of your dataset parallelism.

Try the 2*4 or 4*2 settings.
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

It's the number of logical nodes (degree of parallelism) rather than the number of physical nodes that affects this. If you created a 4-partition dataset with a 1x4 job, then tried to read it with a 1x2 job, your job would read only two of the four partitions unless you use one of the methods I mentioned earlier.

Regards,
- james wiles


All generalizations are false, including this one - Mark Twain.
pneumalin
Premium Member
Premium Member
Posts: 125
Joined: Sat May 07, 2005 6:32 am

Post by pneumalin »

You guys are amazing! Thanks for all the tips, and I will post my findings as well once I get something going.
pneumalin
Premium Member
Premium Member
Posts: 125
Joined: Sat May 07, 2005 6:32 am

Post by pneumalin »

James,
Thanks for clarification on LoadLever, and we will use this as our Resource Manager in Grid. With all the friends' great comment on this topic, I still want to test it myself and post what I find later on. I am still not sure if the DataSet Raw Data File created in compute nodes, and how DS Engine can address them if the compute nodes become unavailable in next run since the raw data file name contains the node name itself. Maybe I should create DataSet in front and back end node only. I will let you guys know after I test it.
Post Reply