Copy Stage

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
jim.paradies
Premium Member
Premium Member
Posts: 25
Joined: Thu Jan 31, 2008 11:06 pm
Location: Australia

Copy Stage

Post by jim.paradies »

I've read in a number of places that the copy stage is compiled out if the Force option is set to false but I've just done a simple experiment.

Code: Select all

CopyTable1 - From source to target with copy stage
------------------------------------------------------------------
ORACLE--------->COPY------------->DATASET


CopyTable2 - Direct from source to target
-------------------------------------------------
ORACLE---------------------------------------->DATASET

And here are the scores generated

CopyTable1 - direct from source to target with copy stage
---------------------------------------------------------
main_program: This step has 3 datasets:
ds0: {op0[2p] (parallel srcCountries)
      eAny=>eCollectAny
      op1[2p] (parallel inCntry_outCntry)}
ds1: {op2[2p] (parallel delete data files in delete D:/Projects/Devl/scratch/coyTable1.ds)
      >>eCollectAny
      op3[1p] (sequential delete descriptor file in delete D:/Projects/Devl/scratch/coyTable1.ds)}
ds2: {op1[2p] (parallel inCntry_outCntry)
      =>
      D:/Projects/Devl/scratch/coyTable1.ds}
It has 4 operators:
op0[2p] {(parallel srcCountries)
    on nodes (
      node1[op0,p0]
      node2[op0,p1]
    )}
op1[2p] {(parallel inCntry_outCntry)
    on nodes (
      node1[op1,p0]
      node2[op1,p1]
    )}
op2[2p] {(parallel delete data files in delete D:/Projects/Devl/scratch/coyTable1.ds)
    on nodes (
      node1[op2,p0]
      node2[op2,p1]
    )}
op3[1p] {(sequential delete descriptor file in delete D:/Projects/Devl/scratch/coyTable1.ds)
    on nodes (
      node1[op3,p0]
    )}
It runs 7 processes on 2 nodes.



CopyTable2 - direct from source to target
----------------------------------------
main_program: This step has 2 datasets:
ds0: {op1[2p] (parallel delete data files in delete D:/Projects/Devl/scratchcopyTable2.ds)
      >>eCollectAny
      op2[1p] (sequential delete descriptor file in delete D:/Projects/Devl/scratchcopyTable2.ds)}
ds1: {op0[2p] (parallel srcCountries)
      =>
      D:/Projects/Devl/scratchcopyTable2.ds}
It has 3 operators:
op0[2p] {(parallel srcCountries)
    on nodes (
      node1[op0,p0]
      node2[op0,p1]
    )}
op1[2p] {(parallel delete data files in delete D:/Projects/Devl/scratchcopyTable2.ds)
    on nodes (
      node1[op1,p0]
      node2[op1,p1]
    )}
op2[1p] {(sequential delete descriptor file in delete D:/Projects/Devl/scratchcopyTable2.ds)
    on nodes (
      node1[op2,p0]
    )}
It runs 5 processes on 2 nodes.


All stages have Combinability Mode set to Auto.

So what am I missing?
Jim Paradies
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Don't know where you read that. The Copy stage is compiled out irrespective if it's not needed (that is, makes an identical copy of its input). Force compile has nothing whatsoever to do with it.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
jim.paradies
Premium Member
Premium Member
Posts: 25
Joined: Thu Jan 31, 2008 11:06 pm
Location: Australia

Post by jim.paradies »

OK Ray.

I stand corrected. It's compiled out if it's not needed AND Force Property is set to false.

(I'm not referring to force compile. It's the Force property in the Copy stage)

In this case, it's a straight copy. No changes to names, no columns dropped, no change to the order of columns.

The manual states
Where you are using a Copy stage with a single input and a single output, you should ensure that you set the Force property in the stage editor TRUE. This prevents WebSphere DataStage from deciding that the Copy operation is superfluous and optimizing it out of the job.
So I'm still curious to know why it isn't optimised out of the job.
Jim Paradies
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Ah, the Force option in the Copy stage itself. Well, that's basically there so that you can require the compiler to include a copy operator even though one is not, strictly speaking, needed. Another way is to use the Copy stage to do something, even something trivial like re-naming a column. You say that this does not apply.

In your score for CopyTable1, I see that the copy operator (inCntry_outCntry) does appear to be present, which I would only expect to see if the Copy stage's Force option were set to True. And you say that it isn't. Would you mind checking?

Maybe the job compiler in version 8.x does things a little differently.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
jim.paradies
Premium Member
Premium Member
Posts: 25
Joined: Thu Jan 31, 2008 11:06 pm
Location: Australia

Post by jim.paradies »

Tried it again just now after creating a completely new job on a different server (same os and ds version) but with 4 nodes and checking that the FORCE option on the copy stage is False.

Code: Select all

main_program: This step has 3 datasets:
ds0: {op0[1p] (sequential ODBC_Enterprise_0)
      eAny<>eCollectAny
      op1[4p] (parallel Copy_1)}
ds1: {op2[4p] (parallel delete data files in delete G:/Projects/QHEST_IM_DEV/Workbenches/JimParadies/jim.ds)
      >>eCollectAny
      op3[1p] (sequential delete descriptor file in delete G:/Projects/QHEST_IM_DEV/Workbenches/JimParadies/jim.ds)}
ds2: {op1[4p] (parallel Copy_1)
      =>
      G:/Projects/QHEST_IM_DEV/Workbenches/JimParadies/jim.ds}
It has 4 operators:
op0[1p] {(sequential ODBC_Enterprise_0)
    on nodes (
      node1[op0,p0]
    )}
op1[4p] {(parallel Copy_1)
    on nodes (
      node1[op1,p0]
      node2[op1,p1]
      node3[op1,p2]
      node4[op1,p3]
    )}
op2[4p] {(parallel delete data files in delete G:/Projects/QHEST_IM_DEV/Workbenches/JimParadies/jim.ds)
    on nodes (
      node1[op2,p0]
      node2[op2,p1]
      node3[op2,p2]
      node4[op2,p3]
    )}
op3[1p] {(sequential delete descriptor file in delete G:/Projects/QHEST_IM_DEV/Workbenches/JimParadies/jim.ds)
    on nodes (
      node1[op3,p0]
    )}
It runs 10 processes on 4 nodes.
Jim Paradies
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Don't know. Can we see the generated OSH? I'm particularly interested in the record schemas - obsfuscate the column names if need be, but preserve the 1-1 relationship between real names and obfuscated names.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
jim.paradies
Premium Member
Premium Member
Posts: 25
Joined: Thu Jan 31, 2008 11:06 pm
Location: Australia

Post by jim.paradies »

Ray,

I've created another job with a much simpler schema and using a row generator.

Here's the score.

Code: Select all

main_program: This step has 3 datasets:
ds0: {op0[1p] (sequential Row_Generator_5)
      eAny<>eCollectAny
      op1[4p] (parallel Copy_1)}
ds1: {op2[4p] (parallel delete data files in delete G:/Projects/DEV/Workbenches/JimParadies/jim.ds)
      >>eCollectAny
      op3[1p] (sequential delete descriptor file in delete G:/Projects/DEV/Workbenches/JimParadies/jim.ds)}
ds2: {op1[4p] (parallel Copy_1)
      =>
      G:/Projects/DEV/Workbenches/JimParadies/jim.ds}
It has 4 operators:
op0[1p] {(sequential Row_Generator_5)
    on nodes (
      node1[op0,p0]
    )}
op1[4p] {(parallel Copy_1)
    on nodes (
      node1[op1,p0]
      node2[op1,p1]
      node3[op1,p2]
      node4[op1,p3]
    )}
op2[4p] {(parallel delete data files in delete G:/Projects/DEV/Workbenches/JimParadies/jim.ds)
    on nodes (
      node1[op2,p0]
      node2[op2,p1]
      node3[op2,p2]
      node4[op2,p3]
    )}
op3[1p] {(sequential delete descriptor file in delete G:/Projects/DEV/Workbenches/JimParadies/jim.ds)
    on nodes (
      node1[op3,p0]
    )}
It runs 10 processes on 4 nodes.
And here's the OSH

Code: Select all

[ident('Copy_1'); jobmon_ident('Copy_1')]
## Inputs
0< [] 'Row_Generator_5:DSLink3.v'
## Outputs
0> [modify (
keep
  bth_dte,emp_id,sex;
)] 'Copy_1:DSLink4.v'
;
#################################################################
#### STAGE: Data_Set_2
## Operator
copy
## General options
[ident('Data_Set_2')]
## Inputs
0< [] 'Copy_1:DSLink4.v'
## Outputs
0>| [ds] 'G:\\Projects\\DEV\\Workbenches\\JimParadies\\jim.ds'
;
#################################################################
#### STAGE: Row_Generator_5
## Operator
generator
## Operator options
-schema record
(
  bth_dte:nullable string[max=10];
  emp_id:nullable string[max=12];
  sex:nullable string[max=2];
)
-records 10
## General options
[ident('Row_Generator_5'); jobmon_ident('Row_Generator_5')]
## Outputs
0> [] 'Row_Generator_5:DSLink3.v'
;
# End of OSH code
Jim Paradies
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

The only thing I can suggest is that the Copy stage is kept because its output does not connect to a virtual Data Set. Indeed, the operator associated with the Data Set stage is copy which may mean that "your" Copy stage was eliminated, but there is a copy operator to transfer data to the Data Set. Test this theory by emplacing any stage type (except a Modify stage or another Copy stage) between the Copy stage and the Data Set stage.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply