How to run jobs in parallel from DataStage

desais_2001 · Post by **desais_2001** » Mon Jun 09, 2003 3:27 am

Hi,

Can somebody guide me how to achieve job parallism from DataStage engine? Our datasize is in TB and takes very long when processed
through DataStage Engine?

Thanks

Sanjay Desai

MAT · Post by **MAT** » Mon Jun 09, 2003 7:24 am

Hi Sanjay,

The Parallelism you can achieve differs depending on what Engine you are using.

With DS Server, you can start multiple jobs at the same time. If you want your jobs to run in parallel (from the software). You only have to start more than one job before you wait for a job to finish. More specifically, if you are designing job sequencers, you can put your jobs in parallel in the Designer GUI and they will start simultaneously. Look at the generated code and you will see that many JobRun are called before a WaitForJob. In the same way, if you start your job from a Basic script that you write yourself, call as many jobs as you want and they will run at the same time. Of course, this parallelism is very limited (You can't make a single job run on multiple processors....I think...).

With DS Parallel Extender, you have the benefits of DS Server plus, you can distribute you tasks through multiple processors. This means that you will benefit from pipelining as well as data partinioning over multiple processors. The increase in speed using DS Parallel Extender is worth it if you have very large volumes. For example, one of our job ran 10 times faster on PX than on Server and we have only 4 processors on our developpement server. Some tasks are even faster. It is still a product in developpement if you ask me (be prepared to have lots if nice little chats with the guys at Acential support and engineering if you acquire it), but the results are very impressive. I would recommend it if you handle large data volumes on which you perform relatively simple transformations, speed is critical and you have a dedicated server (PX will steal all the server ressources when running...the others guys here are starting to hate us).

Hope this helps

MAT

ray.wurlod · Post by **ray.wurlod** » Mon Jun 09, 2003 6:45 pm

Provided we can get legal clearance (that the legal niceties have been followed) from Ascential, a white paper on exactly this topic will appear on www.datastagexchange.com, just as soon as their imprimatur is received.

The particular answer in your case will depend on what release of DataStage you are running. I will assume that you have release 5.2 or later, or release 5.1 with Axcel Pack.

Partition parallelism can be accomplished by starting multiple instances of your job, with each instance processing a partition of the data. Typically you will have parameterized selection criteria in the stage that extracts the data. For example:
WHERE column BETWEEN (#lowvalue# AND #highvalue#)

To run multiple instances of a job, the job must have multi-instance capability enabled (a check box on the job properties window). Then when you invoke it, you append a period and an "invocation ID", which can be any string that provides unique identity (and only contains alphanumeric characters). For example:

hJob1 = DSAttachJob("MyJob.1", DSJ.ERRNONE)
spCode = DSSetParam(hJob1, "lowvalue", 1)
spCode = DSSetParam(hJob1, "highvalue", 5000000)
hJob2 = DSAttachJob("MyJob.2", DSJ.ERRNONE)
spCode = DSSetParam(hJob2, "lowvalue", 5000001)
spCode = DSSetParam(hJob2, "highvalue", 10000000)

In this example, error checking/handling has been omitted for clarity.

It is also possible to create multiple streams within a single server job; if these streams are independent, they will execute in separate processes.

Should you have Parallel Extender installed and licensed (DataStage 6.0 and later), you can encapsulate your server job in a shared container, and allow the parallelism to be handled automatically in a parallel job that includes that shared container.
The Parallel Extender environment allows you to take optimal, yet controlled, advantage of all the processing nodes in a symmetric multi-processor (SMP), or massively parallel processing (MPP) system or cluster.

Ray Wurlod
Education and Consulting Services
ABN 57 092 448 518