Page 1 of 1

Query on Lookup job score

Posted: Thu Dec 30, 2010 10:10 pm
by LD
Hi,

I was testing the lookup performance between
a) when reference data is partitioned using Auto(which uses Entire) and stream data also Auto(may be round robin used here)

b) when both data sets are sorted and hash partition is used.

second option works faster as expected even with data skew. but I observed something different in score dump this time. I found additional datasets created in job score in both the options while creating the lookup.

1) Score when Auto partition is used:

ds0: {/dsdata/application/MHIDEV/TargetFiles/LookupTest1
eAny->eCollectAny
op0[1p] (parallel input repartition(0))}
ds1: {op0[1p] (parallel input repartition(0))
eAny<>eCollectAny
op3[4p] (parallel Data_Set_0)}
ds2: {op1[1p] (sequential Oracle_Enterprise_5)
eEntire->eCollectAny
op2[1p] (parallel APT_LUTCreateOp in Lookup_1)}
ds3: {op2[1p] (parallel APT_LUTCreateOp in Lookup_1)
eEntire<>eCollectAny
op5[4p] (parallel APT_LUTProcessOp in Lookup_1)}
ds4: {op2[1p] (parallel APT_LUTCreateOp in Lookup_1)
eAny<>eCollectAny
op5[4p] (parallel APT_LUTProcessOp in Lookup_1)}
ds5: {op3[4p] (parallel Data_Set_0)
eAny=>eCollectAny
op4[4p] (parallel buffer(0))}
ds6: {op4[4p] (parallel buffer(0))
eSame=>eCollectAny
op5[4p] (parallel APT_LUTProcessOp in Lookup_1)}
ds7: {op5[4p] (parallel APT_LUTProcessOp in Lookup_1)
=>
/dsdata/application/MHIDEV/TargetFiles/LookupTest3}

In this job stream data source was dataset but same similar score comes when I use Oracle stage as source, so this is not due to dataset stage.

Question is why it is creating below two datasets, after reading the reference data in this dataset
ds2: {op1[1p] (sequential Oracle_Enterprise_5)
eEntire->eCollectAny
op2[1p] (parallel APT_LUTCreateOp in Lookup_1)}

ds3: {op2[1p] (parallel APT_LUTCreateOp in Lookup_1)
eEntire<>eCollectAny
op5[4p] (parallel APT_LUTProcessOp in Lookup_1)}
ds4: {op2[1p] (parallel APT_LUTCreateOp in Lookup_1)
eAny<>eCollectAny
op5[4p] (parallel APT_LUTProcessOp in Lookup_1)}

In other jobs only one dataset was created of this type.

2) similar behavior is scene in Hash partition lookup score:

ds0: {/dsdata/application/MHIDEV/TargetFiles/LookupTest2
[pp] eSame=>eCollectAny
op1[4p] (parallel Data_Set_10)}

ds1: {op0[1p] (sequential Oracle_Enterprise_5)
eOther(APT_HashPartitioner { key={ value=DRG_TYPE,
subArgs={ cs }
},
key={ value=DRG_NO }
})<>eCollectAny
op2[4p] (parallel APT_LUTCreateOp in Lookup_1)}

ds2: {op1[4p] (parallel Data_Set_10)
[pp] eSame=>eCollectAny
op3[4p] (parallel buffer(0))}

ds3: {op2[4p] (parallel APT_LUTCreateOp in Lookup_1)
[pp] eEntire#>eCollectAny
op4[4p] (parallel APT_LUTProcessOp in Lookup_1)}

ds4: {op2[4p] (parallel APT_LUTCreateOp in Lookup_1)
[pp] eSame=>eCollectAny
op4[4p] (parallel APT_LUTProcessOp in Lookup_1)}

ds5: {op3[4p] (parallel buffer(0))
[pp] eSame=>eCollectAny
op4[4p] (parallel APT_LUTProcessOp in Lookup_1)}

ds6: {op5[4p] (parallel delete data files in delete /dsdata/application/MHIDEV/TargetFiles/LookupTest3)
>>eCollectAny
op6[1p] (sequential delete descriptor file in delete /dsdata/application/MHIDEV/TargetFiles/LookupTest3)}

ds7: {op4[4p] (parallel APT_LUTProcessOp in Lookup_1)
=>
/dsdata/application/MHIDEV/TargetFiles/LookupTest3}

Posted: Fri Dec 31, 2010 3:29 am
by ray.wurlod
Because a Lookup stage generates what is called a "composite operator" made up from the two operators LUT_CreateOp and LUT_ProcessOp. These need an intermediate (virtual) data set through which they can pass data.