migration DS code to Grid

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
sunitha_cts
Participant
Posts: 98
Joined: Thu Feb 05, 2009 1:14 am
Location: visakhapatnam
Contact:

migration DS code to Grid

Post by sunitha_cts »

Hi All,

We are planning to migrate our DS code to Grid.
Please let me know the changes to be done in DS Jobs if we migrate to Grid.

Thanks
Sunitha
PaulVL
Premium Member
Premium Member
Posts: 1315
Joined: Fri Dec 17, 2010 4:36 pm

Post by PaulVL »

Not much.

Ensure that all of the DBMS tools are installed on each compute node. Oracle client, TD client, informix, etc...

Have an NFS mount for your data and tool install since the binaries need to be accessible to each compute node.

don't forget ssh keys for each compute node.

Disallow server jobs since those can only execute on the Head Node and can't be farmed off on the compute nodes. Sequencers are technically server jobs but should be exempt from that rule.

Get the GRID Enablement Toolkit.

I'd go with (formerly) Platform LSF as your GRM, since IBM just bought them and will be phasing out Load Leveler.

Many smaller servers are better than a few big ones for GRID.

try to minimize as much of the activity on the Head Node (conductor) as you can and farm it off onto the GRID.

Command Sequencers are notorious for putting work on the head node. Try to have a rule up front that tells folks to submit that work to the grid via a command line call. Or execute the desired scripts via an external source stage which will technically farm it off to the grid. force that stage to be executed on only one node. you will achive the desired effect.

WISD jobs... not so grid friendly if you ask me. Think twice about farming those off, since they are persistant jobs.

Write some quick tools up front to validate:

tnsnames.ora
sqlhosts
etc/services
ssh keys
etc...

to ensure that they are levelset accross all compute nodes.

write a CRON job that will purge up your ever growning list of grid_job_dir APT files that will pile up. Should be added to your &PH& file cleanup process too.

train your admins to go out to the compute nodes from time to time and purge orphaned osh executables.
sunitha_cts
Participant
Posts: 98
Joined: Thu Feb 05, 2009 1:14 am
Location: visakhapatnam
Contact:

Post by sunitha_cts »

Thanks Paul,

What are the impacts on Datastage jobs if we move to Grid.

Thanks
Sunitha
PaulVL
Premium Member
Premium Member
Posts: 1315
Joined: Fri Dec 17, 2010 4:36 pm

Post by PaulVL »

Just a different server to execute the same job on.

The choice of which server on the GRID is now allowing you to load balance accross a bunch of servers. You can submit resource constraints (choices) to specify which servers to use.

For instance, I have 8.1 and 8.5 servers sitting in the same GRID. But when I execute an 8.1 job I ask for an 8.1 server (because the binaries are exposed to that grid compute node). I also can ask that the server have X amount of RAM, Y amount of disk space, Z amount of CPU, etc...
Any type of resouce you need can be crafted into your grid submittion request.

Because you are load balancing the job, the difference in execution should be that you have LESS of a constraint in terms of coordinating all of your jobs during your peek time. On an SMP system you only have 1 server with ... 12 cores. In a GRID, you have 1 Head Node that dispatches your job to a server, that server can have any number of cores. Should that server be to busy, you will be load balanced (before the job starts executing) to another server who does have enough horsepower to handle your request.

Your datastage admin will craft a APT_GRID_CONFIG file for you to use. So at execution time, your APT file gets dynamically created for you based upon the degree of parallelism you chose. (IF YOU ARE USING THE GRID ENABLEMENT TOOLKIT SUPPLIED BY IBM. I would highly recommend using it).
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

At a minimum, parallel jobs will require the addition of three grid-related environment variables: APT_GRID_ENABLED, APT_GRID_COMPUTENODES, APT_GRID_PARTITIONS (this is assuming you are using the Grid Toolkit mentioned by Paul). A fourth variable to add would be APT_GRID_QUEUE.

There is a Grid Redbook available on IBM's website, written mainly for IS 8.1 but some of the concepts are still applicable with 8.5 and 8.7. If you're working with IBM on this, the team you will work with will have updated info and guidance that has not made it into the Redbook yet.

Regards,
- james wiles


All generalizations are false, including this one - Mark Twain.
Post Reply