Query on Lookup job score

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
LD
Premium Member
Premium Member
Posts: 32
Joined: Thu Oct 21, 2010 9:03 am

Query on Lookup job score

Post by LD »

Hi,

I was testing the lookup performance between
a) when reference data is partitioned using Auto(which uses Entire) and stream data also Auto(may be round robin used here)

b) when both data sets are sorted and hash partition is used.

second option works faster as expected even with data skew. but I observed something different in score dump this time. I found additional datasets created in job score in both the options while creating the lookup.

1) Score when Auto partition is used:

ds0: {/dsdata/application/MHIDEV/TargetFiles/LookupTest1
eAny->eCollectAny
op0[1p] (parallel input repartition(0))}
ds1: {op0[1p] (parallel input repartition(0))
eAny<>eCollectAny
op3[4p] (parallel Data_Set_0)}
ds2: {op1[1p] (sequential Oracle_Enterprise_5)
eEntire->eCollectAny
op2[1p] (parallel APT_LUTCreateOp in Lookup_1)}
ds3: {op2[1p] (parallel APT_LUTCreateOp in Lookup_1)
eEntire<>eCollectAny
op5[4p] (parallel APT_LUTProcessOp in Lookup_1)}
ds4: {op2[1p] (parallel APT_LUTCreateOp in Lookup_1)
eAny<>eCollectAny
op5[4p] (parallel APT_LUTProcessOp in Lookup_1)}
ds5: {op3[4p] (parallel Data_Set_0)
eAny=>eCollectAny
op4[4p] (parallel buffer(0))}
ds6: {op4[4p] (parallel buffer(0))
eSame=>eCollectAny
op5[4p] (parallel APT_LUTProcessOp in Lookup_1)}
ds7: {op5[4p] (parallel APT_LUTProcessOp in Lookup_1)
=>
/dsdata/application/MHIDEV/TargetFiles/LookupTest3}

In this job stream data source was dataset but same similar score comes when I use Oracle stage as source, so this is not due to dataset stage.

Question is why it is creating below two datasets, after reading the reference data in this dataset
ds2: {op1[1p] (sequential Oracle_Enterprise_5)
eEntire->eCollectAny
op2[1p] (parallel APT_LUTCreateOp in Lookup_1)}

ds3: {op2[1p] (parallel APT_LUTCreateOp in Lookup_1)
eEntire<>eCollectAny
op5[4p] (parallel APT_LUTProcessOp in Lookup_1)}
ds4: {op2[1p] (parallel APT_LUTCreateOp in Lookup_1)
eAny<>eCollectAny
op5[4p] (parallel APT_LUTProcessOp in Lookup_1)}

In other jobs only one dataset was created of this type.

2) similar behavior is scene in Hash partition lookup score:

ds0: {/dsdata/application/MHIDEV/TargetFiles/LookupTest2
[pp] eSame=>eCollectAny
op1[4p] (parallel Data_Set_10)}

ds1: {op0[1p] (sequential Oracle_Enterprise_5)
eOther(APT_HashPartitioner { key={ value=DRG_TYPE,
subArgs={ cs }
},
key={ value=DRG_NO }
})<>eCollectAny
op2[4p] (parallel APT_LUTCreateOp in Lookup_1)}

ds2: {op1[4p] (parallel Data_Set_10)
[pp] eSame=>eCollectAny
op3[4p] (parallel buffer(0))}

ds3: {op2[4p] (parallel APT_LUTCreateOp in Lookup_1)
[pp] eEntire#>eCollectAny
op4[4p] (parallel APT_LUTProcessOp in Lookup_1)}

ds4: {op2[4p] (parallel APT_LUTCreateOp in Lookup_1)
[pp] eSame=>eCollectAny
op4[4p] (parallel APT_LUTProcessOp in Lookup_1)}

ds5: {op3[4p] (parallel buffer(0))
[pp] eSame=>eCollectAny
op4[4p] (parallel APT_LUTProcessOp in Lookup_1)}

ds6: {op5[4p] (parallel delete data files in delete /dsdata/application/MHIDEV/TargetFiles/LookupTest3)
>>eCollectAny
op6[1p] (sequential delete descriptor file in delete /dsdata/application/MHIDEV/TargetFiles/LookupTest3)}

ds7: {op4[4p] (parallel APT_LUTProcessOp in Lookup_1)
=>
/dsdata/application/MHIDEV/TargetFiles/LookupTest3}
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Because a Lookup stage generates what is called a "composite operator" made up from the two operators LUT_CreateOp and LUT_ProcessOp. These need an intermediate (virtual) data set through which they can pass data.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply