Preformance in job contaning Column Import stage

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
evee1
Premium Member
Premium Member
Posts: 96
Joined: Tue Oct 06, 2009 4:17 pm
Location: Melbourne, AU

Preformance in job contaning Column Import stage

Post by evee1 »

I'm testing the performance of our system processing very large file (400 millim records) using some very simple jobs.
Here are two examples of my test jobs:
Job 1
Fileset ---> Copy

Job 2
Fileset ---> Column Import ---> Copy

The processing time increased from 10 mins for Job 1 to 25 mins for job 2.
Column Import stage splits one field DATA (varchar 2000) into 7 fields following the given schema file. I tried with the explicit column definition and it takes nearly the same amount of time.
I have turned on partitioning on the input link of the CI stage (Hash on DATA column), and it helped to reduce processing time to 15 mins.
I wonder what I can do to reduce this time even further.
I am using 16 nodes config file.
Am I correct in assuming the Column Import works in parallel? If not, how can I make it so?

Here is the job score:

Code: Select all

main_program: This step has 1 dataset:
ds0: {op0[16p] (parallel fs_DataIn)
 eAny=>eCollectAny
 op1[16p] (parallel APT_CombinedOperatorController:ci_Data)}
It has 2 operators:
op0[16p] {(parallel fs_DataIn)
 on nodes (
 node1[op0,p0] node2[op0,p1] node3[op0,p2] node4[op0,p3] node5[op0,p4] node6[op0,p5]
 node7[op0,p6] node8[op0,p7] node9[op0,p8] node10[op0,p9] node11[op0,p10] node12[op0,p11]
 node13[op0,p12] node14[op0,p13] node15[op0,p14] node16[op0,p15]
 )}
op1[16p] {(parallel APT_CombinedOperatorController:
 (ci_Data)
 (Copy_188)
 ) on nodes (
 node1[op1,p0] node2[op1,p1] node3[op1,p2] node4[op1,p3] node5[op1,p4] node6[op1,p5]
 node7[op1,p6] node8[op1,p7] node9[op1,p8] node10[op1,p9] node11[op1,p10] node12[op1,p11]
 node13[op1,p12] node14[op1,p13] node15[op1,p14] node16[op1,p15]
 )}
It runs 32 processes on 16 nodes.
Thanks.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

You don't have to assume anything. The score tells you that the op1 operator (the Column Input stage - which is combined in the same process as the Copy operator (the Copy stage)) - executes in parallel.

Check the monitor to determine CPU consumption. Though I doubt that a Copy stage acting as a sink will consume much CPU.

It might be interesting to compare the performance of Column Import with that of a Transformer stage doing exactly the same parsing and nothing else.
Last edited by ray.wurlod on Thu Sep 01, 2011 11:42 pm, edited 1 time in total.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
evee1
Premium Member
Premium Member
Posts: 96
Joined: Tue Oct 06, 2009 4:17 pm
Location: Melbourne, AU

Post by evee1 »

My "assumption" was based on my lack of knowledge how to interpret the score. I was suspecting that it runs in parallel, but needed some confirmation :wink:. Thanks.

I might test the parsing method as well, but probably not until Monday.
Post Reply