Page 1 of 1

Link Partitioning/Collecting or Coded Round Robin Partitions

Posted: Thu Mar 11, 2004 9:46 am
by simong
All,

I am attaining 5,000 Rows Per Second (I know it's bytes that really matter) from DB2 (Using DB2/UDB Object) through and IPC and out to a SQL Bulk/Load Object in DataStage.

I am looking to introduce a parallel alternative (Do not have PX yet) in order to use more than 1 CPU (I have 8 in total).

My question is should I use 4 SEPARATE jobs with the MOD(@INROWNUM, PartitionNumber) = Partitioncount - 1 method or should I be using Link Partitioner and Link Collector objects in a single job complete with a separate Active stage for each flow (such as a 1-to-1-mapping Transform Stage).

I have tried the later and no apparent performance improved occured.

Posted: Thu Mar 11, 2004 1:06 pm
by kcbland
Optimize your spool of data out of DB2. Write a job that is just DB2 --> xfm --> seq and see how fast that runs. Then, pick a numeric column, something like the primary key, and instantiate that job passing in the parameters to divide the data in the SQL. You'll find that the degree of instantiantion for spooling data from a database will be different than the degree of instantiation for transformation, and also the loading. You want the flexibility to maybe run the extract X 3, and the transformation X 10, and the loading via a bulk loader.

Re: Link Partitioning/Collecting or Coded Round Robin Partit

Posted: Fri Mar 12, 2004 4:36 am
by santhu
santhu wrote:
Hi,

When you tried using IPC with Link partitioner and collector, did you enable "Inter-process" buffering option under "Performance" tab at the job level? This is must to get the best performance for multi-process jobs.

I agree completely with Kenneth's sugestion.

Write a job that is just DB2 --> xfm --> seq , and try testing the performance with "Multi job Instance" option enabled under Job properties.
Each instance of job with sequential file name, and different range of source key column data as parameters

Regards,
Santhosh S

Posted: Fri Mar 12, 2004 10:52 am
by simong
Ok,

thanks for the advice. I will try this and get back to you early next week with my findings.

It is also worth mentioning that the SQL Bulk Loader seems fine in this process in terms of performance. I have heard a lot on this forum suggesting that the Bulk Loaders with DataStage are very poorly designed.

Is this still the case with Version 7 as I guess the alternative is to create a control file (based on a single output record) and upload via a sequential file instead?

Simon

Posted: Fri Mar 12, 2004 10:56 am
by simong
I have each job set with the appropriate properties in terms of multiple instancing and inter-process buffering.

Are you guys talking about a Control job calling these 4-5 identical jobs (with different selection criteria) or executing them as 5 separate jobs from director? In fact does it make a difference?

Posted: Fri Mar 12, 2004 11:05 am
by kcbland
simong wrote:Are you guys talking about a Control job calling these 4-5 identical jobs (with different selection criteria) or executing them as 5 separate jobs from director? In fact does it make a difference?
You can either make physical clones, or use job instances. The affect is the same. An instance is just a copy of another job that is intimately linked to the master (shares the same job log, compiling the master deletes all instances, etc).

Posted: Mon Mar 15, 2004 1:17 am
by santhu
[quote="santhu"]

I think it's better to declare the job as Multi job instance rather than making N copies as any change in design in the latter scenario, you need to change in all 5 jobs, i.e redundant work.

In multi job instance, the same job design runs with different job parameters and can run concurrently.

Rgds,
Santhosh S

Posted: Mon Mar 15, 2004 1:19 am
by santhu
DB2 Bulk loader or any Bulk loader provided by Datastage internally calls the respective DB's Bulk loading utility. So it's not slow as it inserts without DB logging.

Thanks,
Santhosh S