Preformance in job contaning Column Import stage

evee1 · Post by **evee1** » Thu Sep 01, 2011 10:51 pm

I'm testing the performance of our system processing very large file (400 millim records) using some very simple jobs.
Here are two examples of my test jobs:
Job 1
Fileset ---> Copy

Job 2
Fileset ---> Column Import ---> Copy

The processing time increased from 10 mins for Job 1 to 25 mins for job 2.
Column Import stage splits one field DATA (varchar 2000) into 7 fields following the given schema file. I tried with the explicit column definition and it takes nearly the same amount of time.
I have turned on partitioning on the input link of the CI stage (Hash on DATA column), and it helped to reduce processing time to 15 mins.
I wonder what I can do to reduce this time even further.
I am using 16 nodes config file.
Am I correct in assuming the Column Import works in parallel? If not, how can I make it so?

Here is the job score:

Code: Select all

main_program: This step has 1 dataset:
ds0: {op0[16p] (parallel fs_DataIn)
 eAny=>eCollectAny
 op1[16p] (parallel APT_CombinedOperatorController:ci_Data)}
It has 2 operators:
op0[16p] {(parallel fs_DataIn)
 on nodes (
 node1[op0,p0] node2[op0,p1] node3[op0,p2] node4[op0,p3] node5[op0,p4] node6[op0,p5]
 node7[op0,p6] node8[op0,p7] node9[op0,p8] node10[op0,p9] node11[op0,p10] node12[op0,p11]
 node13[op0,p12] node14[op0,p13] node15[op0,p14] node16[op0,p15]
 )}
op1[16p] {(parallel APT_CombinedOperatorController:
 (ci_Data)
 (Copy_188)
 ) on nodes (
 node1[op1,p0] node2[op1,p1] node3[op1,p2] node4[op1,p3] node5[op1,p4] node6[op1,p5]
 node7[op1,p6] node8[op1,p7] node9[op1,p8] node10[op1,p9] node11[op1,p10] node12[op1,p11]
 node13[op1,p12] node14[op1,p13] node15[op1,p14] node16[op1,p15]
 )}
It runs 32 processes on 16 nodes.

Thanks.

ray.wurlod · Post by **ray.wurlod** » Thu Sep 01, 2011 10:58 pm

You don't have to assume anything. The score tells you that the op1 operator (the Column Input stage - which is combined in the same process as the Copy operator (the Copy stage)) - executes in parallel.

Check the monitor to determine CPU consumption. Though I doubt that a Copy stage acting as a sink will consume much CPU.

It might be interesting to compare the performance of Column Import with that of a Transformer stage doing exactly the same parsing and nothing else.

evee1 · Post by **evee1** » Thu Sep 01, 2011 11:28 pm

My "assumption" was based on my lack of knowledge how to interpret the score. I was suspecting that it runs in parallel, but needed some confirmation

. Thanks.

I might test the parsing method as well, but probably not until Monday.