Copy Stage

jim.paradies · Post by **jim.paradies** » Sat Jul 19, 2008 5:40 pm

I've read in a number of places that the copy stage is compiled out if the Force option is set to false but I've just done a simple experiment.

Code: Select all

CopyTable1 - From source to target with copy stage
------------------------------------------------------------------
ORACLE--------->COPY------------->DATASET


CopyTable2 - Direct from source to target
-------------------------------------------------
ORACLE---------------------------------------->DATASET

And here are the scores generated

CopyTable1 - direct from source to target with copy stage
---------------------------------------------------------
main_program: This step has 3 datasets:
ds0: {op0[2p] (parallel srcCountries)
      eAny=>eCollectAny
      op1[2p] (parallel inCntry_outCntry)}
ds1: {op2[2p] (parallel delete data files in delete D:/Projects/Devl/scratch/coyTable1.ds)
      >>eCollectAny
      op3[1p] (sequential delete descriptor file in delete D:/Projects/Devl/scratch/coyTable1.ds)}
ds2: {op1[2p] (parallel inCntry_outCntry)
      =>
      D:/Projects/Devl/scratch/coyTable1.ds}
It has 4 operators:
op0[2p] {(parallel srcCountries)
    on nodes (
      node1[op0,p0]
      node2[op0,p1]
    )}
op1[2p] {(parallel inCntry_outCntry)
    on nodes (
      node1[op1,p0]
      node2[op1,p1]
    )}
op2[2p] {(parallel delete data files in delete D:/Projects/Devl/scratch/coyTable1.ds)
    on nodes (
      node1[op2,p0]
      node2[op2,p1]
    )}
op3[1p] {(sequential delete descriptor file in delete D:/Projects/Devl/scratch/coyTable1.ds)
    on nodes (
      node1[op3,p0]
    )}
It runs 7 processes on 2 nodes.



CopyTable2 - direct from source to target
----------------------------------------
main_program: This step has 2 datasets:
ds0: {op1[2p] (parallel delete data files in delete D:/Projects/Devl/scratchcopyTable2.ds)
      >>eCollectAny
      op2[1p] (sequential delete descriptor file in delete D:/Projects/Devl/scratchcopyTable2.ds)}
ds1: {op0[2p] (parallel srcCountries)
      =>
      D:/Projects/Devl/scratchcopyTable2.ds}
It has 3 operators:
op0[2p] {(parallel srcCountries)
    on nodes (
      node1[op0,p0]
      node2[op0,p1]
    )}
op1[2p] {(parallel delete data files in delete D:/Projects/Devl/scratchcopyTable2.ds)
    on nodes (
      node1[op1,p0]
      node2[op1,p1]
    )}
op2[1p] {(sequential delete descriptor file in delete D:/Projects/Devl/scratchcopyTable2.ds)
    on nodes (
      node1[op2,p0]
    )}
It runs 5 processes on 2 nodes.

All stages have Combinability Mode set to Auto.

So what am I missing?

ray.wurlod · Post by **ray.wurlod** » Sat Jul 19, 2008 10:28 pm

Don't know where you read that. The Copy stage is compiled out irrespective if it's not needed (that is, makes an identical copy of its input). Force compile has nothing whatsoever to do with it.

jim.paradies · Post by **jim.paradies** » Sat Jul 19, 2008 10:40 pm

OK Ray.

I stand corrected. It's compiled out if it's not needed AND Force Property is set to false.

(I'm not referring to force compile. It's the Force property in the Copy stage)

In this case, it's a straight copy. No changes to names, no columns dropped, no change to the order of columns.

The manual states

Where you are using a Copy stage with a single input and a single output, you should ensure that you set the Force property in the stage editor TRUE. This prevents WebSphere DataStage from deciding that the Copy operation is superfluous and optimizing it out of the job.

So I'm still curious to know why it isn't optimised out of the job.

ray.wurlod · Post by **ray.wurlod** » Sun Jul 20, 2008 2:53 am

Ah, the Force option in the Copy stage itself. Well, that's basically there so that you can require the compiler to include a copy operator even though one is not, strictly speaking, needed. Another way is to use the Copy stage to do something, even something trivial like re-naming a column. You say that this does not apply.

In your score for CopyTable1, I see that the copy operator (inCntry_outCntry) does appear to be present, which I would only expect to see if the Copy stage's Force option were set to True. And you say that it isn't. Would you mind checking?

Maybe the job compiler in version 8.x does things a little differently.

jim.paradies · Post by **jim.paradies** » Sun Jul 20, 2008 9:29 pm

Tried it again just now after creating a completely new job on a different server (same os and ds version) but with 4 nodes and checking that the FORCE option on the copy stage is False.

Code: Select all

main_program: This step has 3 datasets:
ds0: {op0[1p] (sequential ODBC_Enterprise_0)
      eAny<>eCollectAny
      op1[4p] (parallel Copy_1)}
ds1: {op2[4p] (parallel delete data files in delete G:/Projects/QHEST_IM_DEV/Workbenches/JimParadies/jim.ds)
      >>eCollectAny
      op3[1p] (sequential delete descriptor file in delete G:/Projects/QHEST_IM_DEV/Workbenches/JimParadies/jim.ds)}
ds2: {op1[4p] (parallel Copy_1)
      =>
      G:/Projects/QHEST_IM_DEV/Workbenches/JimParadies/jim.ds}
It has 4 operators:
op0[1p] {(sequential ODBC_Enterprise_0)
    on nodes (
      node1[op0,p0]
    )}
op1[4p] {(parallel Copy_1)
    on nodes (
      node1[op1,p0]
      node2[op1,p1]
      node3[op1,p2]
      node4[op1,p3]
    )}
op2[4p] {(parallel delete data files in delete G:/Projects/QHEST_IM_DEV/Workbenches/JimParadies/jim.ds)
    on nodes (
      node1[op2,p0]
      node2[op2,p1]
      node3[op2,p2]
      node4[op2,p3]
    )}
op3[1p] {(sequential delete descriptor file in delete G:/Projects/QHEST_IM_DEV/Workbenches/JimParadies/jim.ds)
    on nodes (
      node1[op3,p0]
    )}
It runs 10 processes on 4 nodes.

ray.wurlod · Post by **ray.wurlod** » Sun Jul 20, 2008 10:51 pm

Don't know. Can we see the generated OSH? I'm particularly interested in the record schemas - obsfuscate the column names if need be, but preserve the 1-1 relationship between real names and obfuscated names.

jim.paradies · Post by **jim.paradies** » Mon Jul 21, 2008 2:08 am

Ray,

I've created another job with a much simpler schema and using a row generator.

Here's the score.

Code: Select all

main_program: This step has 3 datasets:
ds0: {op0[1p] (sequential Row_Generator_5)
      eAny<>eCollectAny
      op1[4p] (parallel Copy_1)}
ds1: {op2[4p] (parallel delete data files in delete G:/Projects/DEV/Workbenches/JimParadies/jim.ds)
      >>eCollectAny
      op3[1p] (sequential delete descriptor file in delete G:/Projects/DEV/Workbenches/JimParadies/jim.ds)}
ds2: {op1[4p] (parallel Copy_1)
      =>
      G:/Projects/DEV/Workbenches/JimParadies/jim.ds}
It has 4 operators:
op0[1p] {(sequential Row_Generator_5)
    on nodes (
      node1[op0,p0]
    )}
op1[4p] {(parallel Copy_1)
    on nodes (
      node1[op1,p0]
      node2[op1,p1]
      node3[op1,p2]
      node4[op1,p3]
    )}
op2[4p] {(parallel delete data files in delete G:/Projects/DEV/Workbenches/JimParadies/jim.ds)
    on nodes (
      node1[op2,p0]
      node2[op2,p1]
      node3[op2,p2]
      node4[op2,p3]
    )}
op3[1p] {(sequential delete descriptor file in delete G:/Projects/DEV/Workbenches/JimParadies/jim.ds)
    on nodes (
      node1[op3,p0]
    )}
It runs 10 processes on 4 nodes.

And here's the OSH

Code: Select all

[ident('Copy_1'); jobmon_ident('Copy_1')]
## Inputs
0< [] 'Row_Generator_5:DSLink3.v'
## Outputs
0> [modify (
keep
  bth_dte,emp_id,sex;
)] 'Copy_1:DSLink4.v'
;
#################################################################
#### STAGE: Data_Set_2
## Operator
copy
## General options
[ident('Data_Set_2')]
## Inputs
0< [] 'Copy_1:DSLink4.v'
## Outputs
0>| [ds] 'G:\\Projects\\DEV\\Workbenches\\JimParadies\\jim.ds'
;
#################################################################
#### STAGE: Row_Generator_5
## Operator
generator
## Operator options
-schema record
(
  bth_dte:nullable string[max=10];
  emp_id:nullable string[max=12];
  sex:nullable string[max=2];
)
-records 10
## General options
[ident('Row_Generator_5'); jobmon_ident('Row_Generator_5')]
## Outputs
0> [] 'Row_Generator_5:DSLink3.v'
;
# End of OSH code

ray.wurlod · Post by **ray.wurlod** » Mon Jul 21, 2008 2:33 am

The only thing I can suggest is that the Copy stage is kept because its output does not connect to a virtual Data Set. Indeed, the operator associated with the Data Set stage is copy which may mean that "your" Copy stage was eliminated, but there is a copy operator to transfer data to the Data Set. Test this theory by emplacing any stage type (except a Modify stage or another Copy stage) between the Copy stage and the Data Set stage.