SMPs and process distribution

gateleys · Post by **gateleys** » Mon Mar 20, 2006 3:17 pm

Our DataStage server runs on an SMP with 4 processors. Now, let's say I have a simple job, with a flat-file as source and a transformer with outlinks to ORAOCI9 (just inserts) and Reject_file. With the inprocess and interprocess buffers left at default, will the server split the load among the processors for the inserts? I am using 7.0 server ed. on Windows. For a "simple-designed, but huge data responsibility" jobs, is there any way that I maximize the throughput via optimal utilization of these processors? Something in the direction of data partitioning and assigning to nodes in parallel systems, but with Server ed. resources.

Thanks.
gateleys

DSguru2B · Post by **DSguru2B** » Mon Mar 20, 2006 3:25 pm

Did you try using Link partitioner. As far as I have read from Rays post that in server, the only way to use parallelism is using the Link Partitioner stage.

kcbland · Post by **kcbland** » Mon Mar 20, 2006 3:35 pm

Absolutely. Use job instances with a constraint that divides the rows evenly among the instances. Each job is responsible for a portion of the data. You can run 4 copies of the job, each inserting 1/4 of the rows. This is quite easy and elegant and we were doing this long before PX.

I've got a document on my website called Performance Analysis which covers this effort.

gateleys · Post by **gateleys** » Mon Mar 20, 2006 3:44 pm

DSguru2B wrote:Did you try using Link partitioner. As far as I have read from Rays post that in server, the only way to use parallelism is using the Link Partitioner stage.

I have used Link Partitioner and Collector in situations where the input rows can be seggregated based on certain conditions applied at the transformer in each partitions (between the partitioner and collector). However, if I just need to read from a sequential file, and via a transformer, load a table, I do NOT have any constraint which would be the basis for partitioning my data. If I used something like rownums (or some IDs) to split them, it would not be seamless. Can you suggest a design whereby the partitioner/collector set could be used in my case?

Thanks,
gateleys

gateleys · Post by **gateleys** » Mon Mar 20, 2006 4:04 pm

kcbland wrote:Absolutely. Use job instances with a constraint that divides the rows evenly among the instances.
I've got a document on my website called Performance Analysis which covers this effort.

Hi Kenneth,
Sorry about yanking part of your response. Can you provide some more guidelines into creating multiple instances of the kind of job I was talking about? What would the design look like? And can you provide a link to your website also, please!!

Thanks,
gateleys

chulett · Post by **chulett** » Mon Mar 20, 2006 4:31 pm

gateleys wrote:And can you provide a link to your website also, please!!

At the bottom of every one of his posts.

ray.wurlod · Post by **ray.wurlod** » Mon Mar 20, 2006 7:56 pm

DSguru2B wrote:As far as I have read from Rays post that in server, the only way to use parallelism is using the Link Partitioner stage.

You have misconstrued whatever it was you read. There are at least five different ways to effect partition parallelism in server jobs, including multiple independent streams in one job, multiple jobs, and multiple instances of multi-instance jobs. As well as Link Partitioner, and using a Transfomer stage to split the input streams into many, each with active processing downstream.

gateleys · Post by **gateleys** » Tue Mar 21, 2006 7:28 am

gateleys · Post by **gateleys** » Tue Mar 21, 2006 7:29 am

I really appreciate your comments. However, will someone please take me a step ahead of what Kenneth has suggested in terms of using job instances. How do I create it, and how do I specify constraints to split my input rows into different portions that can feed each processor. All I can see is the Allow multiple instances checkbox in the job properties.

Thanks,
gateleys

ArndW · Post by **ArndW** » Tue Mar 21, 2006 9:10 am

Allowing multiple instances means you can run any number of copies of the same job at the same time.

Let's assume you are reading from a database table into a hashed file and that your key is a numeric field in both files. If you have a single instance of a job you would have your source stage do a SELECT from the table and write to the hashed file.

In a simple multi-instance version of the same job you could add a parameter called INSTANCENUMBER and call this job from a sequencer 3 times in parallel, each instance getting it's own unique name and passing a 0, 1, or 2 as the INSTANCENUMBER.

Your user SQL SELECT clause contains a

Code: Select all

WHERE MOD(KEY,3) EQ #INSTANCENUMBER#

which ensures that each instance get 1/3 of the records selected, assuming your KEY has an even distribution.

kcbland · Post by **kcbland** » Tue Mar 21, 2006 9:20 am

I guess you don't want the 10 page document that SHOWS you how to do this.

Do you see underneath all of my posts there's a "Posters Website" link?

gateleys · Post by **gateleys** » Tue Mar 21, 2006 1:47 pm

I tried parallelizing my process, with 3 separate parallel links from input feeding the output link. MOD(@INROWNUM,3) = PartitionNumber - 1 was used as constraint in the transformer, where the partition number is a job parameter, as specified in Kenneth's doc. The problem is that I get an error that the different processes cannot write to the same sequential file (which is my output). It seems there is a conflict over the target resource. How do I get over this? And yes, my job is defined as 'Allow multiple instance'.

Thanks,
gateleys

kcbland · Post by **kcbland** » Tue Mar 21, 2006 2:01 pm

Read closer and see that the output sequential file has to be uniquely named, in the filename include either a job parameter that is instance number aware (in my example I use PartitionNumber) or the invocation ID (a macro value available).

gateleys · Post by **gateleys** » Tue Mar 21, 2006 2:09 pm

Oooops!! Missed that. Yeah, Kenneth. Thanks again. The need came intuitively to me, and the possibility was addressed here in this forum, and the direction given here and in your performance doc. Thanks all of you.

gateleys

kcbland · Post by **kcbland** » Tue Mar 21, 2006 2:17 pm

Okay, but how about the good stuff. Are you moving more data and using more of your server now? Are ya happy?