SCD Type 2 producing incorrect inserts

Gazelle · Post by **Gazelle** » Sun Jun 03, 2012 6:53 pm

Problem:

- When rerunning the job with no changes to the source data, the SCD stage identifies all rows as new inserts.
- This problem does not occur when running on a single node.

SCD configuration:

- Input partitioning tab: The dimension link partitioning=Auto; the input link partition=Hash on BAS_NK & SOURCE_SYSTEM_ID, and sorted on BAS_NK & SOURCE_SYSTEM_ID, and SNAPSHOT_TIMESTAMP.
- Lookup tab purpose codes:
Project_Baseline_Key=Surrogate Key
BAS_NK=Business Key
Baseline_Name=Type 2
Description=Type 2
Last_Update=Type 2
Valid_From_Timestamp=Effective Date
Valid_To_Timestamp=Expiration Date
SOURCE_SYSTEM_ID=Business Key
- Lookup tab Key Expressions: Only the BAS_NK & SOURCE_SYSTEM_ID are used, and all other columns are blank (i.e. not used for matching).
- Surrogate Key tab: Uses a DB2 Sequence.
- Dim Update tab:
Project_Baseline_Key=NextSurrogateKey()
BAS_NK=BAS_NK
Baseline_Name=Baseline_Name
Description=Description
Last_Update=Last_Update
Valid_From_Timestamp=SNAPSHOT_TIMESTAMP
Valid_To_Timestamp='2999-12-31 23:59:59.000000'
SOURCE_SYSTEM_ID=SOURCE_SYSTEM_ID
- Output Map: Too many to list; the main thing is the Project_Baseline_Key is passed through, but not the other dimension columns.

Investigation:

- I had a look at the Score, but do not see the problem. I've reduced the job to just the problem SCD stage, and attached the Score:

main_program: This step has 12 datasets:
ds0: {/apps/InformationServer/dsadmd/descriptors/TestInput.ds
      eAny=>eCollectAny
      op3[4p] (parallel dsTestInput)}
ds1: {op0[1p] (sequential db2DimBasOld)
      eAny<>eCollectAny
      op1[4p] (parallel copyPlaceholder4)}
ds2: {op1[4p] (parallel copyPlaceholder4)
      eAny=>eCollectAny
      op2[4p] (parallel copyPlaceholder4:RF02Before-FILTERED)}
ds3: {op2[4p] (parallel copyPlaceholder4:RF02Before-FILTERED)
      eOther(APT_HashPartitioner { key={ value=BAS_NK },
  key={ value=SOURCE_SYSTEM_ID }
})#>eCollectAny
      op6[4p] (parallel inserted tsort operator {key={value=BAS_NK, subArgs={asc, nulls={value=first}, cs}}, key={value=SOURCE_SYSTEM_ID, subArgs={asc, nulls={value=first}, cs}}}(0) in scdDimBaseline)}
ds4: {op3[4p] (parallel dsTestInput)
      eAny=>eCollectAny
      op4[4p] (parallel copyPartitionBasNK)}
ds5: {op4[4p] (parallel copyPartitionBasNK)
      eOther(APT_HashPartitioner { key={ value=BAS_NK, 
        subArgs={ cs }
      },
  key={ value=SOURCE_SYSTEM_ID, 
        subArgs={ cs }
      }
})#>eCollectAny
      op5[4p] (parallel scdDimBaseline.BA10_RepPerd_Sort)}
ds6: {op5[4p] (parallel scdDimBaseline.BA10_RepPerd_Sort)
      [pp] eSame=>eCollectAny
      op8[4p] (parallel APT_TransformOperatorImplV24S13_Debug01_FactPrjRepPerdPerf_scdDimBaseline in scdDimBaseline)}
ds7: {op6[4p] (parallel inserted tsort operator {key={value=BAS_NK, subArgs={asc, nulls={value=first}, cs}}, key={value=SOURCE_SYSTEM_ID, subArgs={asc, nulls={value=first}, cs}}}(0) in scdDimBaseline)
      [pp] eSame=>eCollectAny
      op7[4p] (parallel APT_LUTCreateOp in scdDimBaseline)}
ds8: {op7[4p] (parallel APT_LUTCreateOp in scdDimBaseline)
      eEntire#>eCollectAny
      op8[4p] (parallel APT_TransformOperatorImplV24S13_Debug01_FactPrjRepPerdPerf_scdDimBaseline in scdDimBaseline)}
ds9: {op7[4p] (parallel APT_LUTCreateOp in scdDimBaseline)
      eAny=>eCollectAny
      op8[4p] (parallel APT_TransformOperatorImplV24S13_Debug01_FactPrjRepPerdPerf_scdDimBaseline in scdDimBaseline)}
ds10: {op8[4p] (parallel APT_TransformOperatorImplV24S13_Debug01_FactPrjRepPerdPerf_scdDimBaseline in scdDimBaseline)
      eAny=>eCollectAny
      op9[4p] (parallel copyPlaceholder5)}
ds11: {op8[4p] (parallel APT_TransformOperatorImplV24S13_Debug01_FactPrjRepPerdPerf_scdDimBaseline in scdDimBaseline)
      [pp] eSame=>eCollectAny
      op10[4p] (parallel copyPlaceholder6)}
It has 11 operators:
op0[1p] {(sequential db2DimBasOld)
    on nodes (
      node1[op0,p0]
    )}
op1[4p] {(parallel copyPlaceholder4)
    on nodes (
      node1[op1,p0]
      node2[op1,p1]
      node3[op1,p2]
      node4[op1,p3]
    )}
op2[4p] {(parallel copyPlaceholder4:RF02Before-FILTERED)
    on nodes (
      node1[op2,p0]
      node2[op2,p1]
      node3[op2,p2]
      node4[op2,p3]
    )}
op3[4p] {(parallel dsTestInput)
    on nodes (
      node1[op3,p0]
      node2[op3,p1]
      node3[op3,p2]
      node4[op3,p3]
    )}
op4[4p] {(parallel copyPartitionBasNK)
    on nodes (
      node1[op4,p0]
      node2[op4,p1]
      node3[op4,p2]
      node4[op4,p3]
    )}
op5[4p] {(parallel scdDimBaseline.BA10_RepPerd_Sort)
    on nodes (
      node1[op5,p0]
      node2[op5,p1]
      node3[op5,p2]
      node4[op5,p3]
    )}
op6[4p] {(parallel inserted tsort operator {key={value=BAS_NK, subArgs={asc, nulls={value=first}, cs}}, key={value=SOURCE_SYSTEM_ID, subArgs={asc, nulls={value=first}, cs}}}(0) in scdDimBaseline)
    on nodes (
      node1[op6,p0]
      node2[op6,p1]
      node3[op6,p2]
      node4[op6,p3]
    )}
op7[4p] {(parallel APT_LUTCreateOp in scdDimBaseline)
    on nodes (
      node1[op7,p0]
      node2[op7,p1]
      node3[op7,p2]
      node4[op7,p3]
    )}
op8[4p] {(parallel APT_TransformOperatorImplV24S13_Debug01_FactPrjRepPerdPerf_scdDimBaseline in scdDimBaseline)
    on nodes (
      node1[op8,p0]
      node2[op8,p1]
      node3[op8,p2]
      node4[op8,p3]
    )}
op9[4p] {(parallel copyPlaceholder5)
    on nodes (
      node1[op9,p0]
      node2[op9,p1]
      node3[op9,p2]
      node4[op9,p3]
    )}
op10[4p] {(parallel copyPlaceholder6)
    on nodes (
      node1[op10,p0]
      node2[op10,p1]
      node3[op10,p2]
      node4[op10,p3]
    )}
It runs 41 processes on 4 nodes.

Since it works on 1 node, my guess is that I've messed up the partitioning somehow. Or is it that we've used '2999-12-31 23:59:59.000000' to indicate the current rows, instead of null?
I'm sure it'll be something simple, and I'll scamper off embarrassed and with my tail between my legs.
But I just can't put my finger on the problem.

Any ideas?

chulett · Post by **chulett** » Sun Jun 03, 2012 7:38 pm

Off the top of my head, I'd try setting the dimension link partitioning to match the input link: Hash on BAS_NK & SOURCE_SYSTEM_ID.

Gazelle · Post by **Gazelle** » Sun Jun 03, 2012 10:29 pm

Thanks for the response Craig, but it didn't fix things.

The Score now shows:

Code: Select all

.
.
.
ds2: {op1[4p] (parallel copyPlaceholder4)
      eAny=>eCollectAny
      op2[4p] (parallel copyPlaceholder4:RF02Before-FILTERED)}
ds3: {op1[4p] (parallel copyPlaceholder4)
      eAny=>eCollectAny
      op7[4p] (parallel APT_TransformOperatorImplV0S208_Debug01_FactPrjRepPerdPerf_Transformer_208 in Transformer_208)}
ds4: {op2[4p] (parallel copyPlaceholder4:RF02Before-FILTERED)
      eOther(APT_HashPartitioner { key={ value=BAS_NK },
  key={ value=SOURCE_SYSTEM_ID }
})#>eCollectAny
      op9[4p] (parallel inserted tsort operator {key={value=BAS_NK, subArgs={asc, nulls={value=first}, cs}}, key={value=SOURCE_SYSTEM_ID, subArgs={asc, nulls={value=first}, cs}}}(0) in scdDimBaseline)}

.
.
.

I have also copied the two sets of keys to a transformer and converted them to binary (to check for any non-printable characters), but they look like the keys should match.
After all, they match when using 1 node.

priyadarshikunal · Post by **priyadarshikunal** » Mon Jun 04, 2012 1:33 am

Also check the datatype of the hash keys, there should be a exact match. Remember that the Hash for unicode string is different from non-unicode one even if the string is same.

If it works on single node then the wrong partitioning has the highest probablity.

Gazelle · Post by **Gazelle** » Mon Jun 04, 2012 5:23 am

Wow, how did you do that!? It's as though you're looking over my shoulder.
Yes, I had missed that the input column was Unicode and the dimension column wasn't.

But correcting this oversight still didn't fix the problem.

Stranger still, it still shouldn't find a match when running on 1 node.

I'm getting even more puzzled by this problem.

Any ideas on how to approach debugging this one?

priyadarshikunal · Post by **priyadarshikunal** » Mon Jun 04, 2012 5:34 am

I did that mistake myself once and took me hours to figure that out. May be just because I was looking at other parts to find the mistake.

I would probably look at the sort order (including case sensitive and null position) , column metadata, partitioning method and its order and may be at RCP as well. and after verifying all, a force compile.

priyadarshikunal · Post by **priyadarshikunal** » Mon Jun 04, 2012 5:38 am

Or you might just need another pair of eyes to verify the job.

Gazelle · Post by **Gazelle** » Wed Jun 06, 2012 11:36 pm

Yes, I've had a couple of other sets of eyes look at it, but none of us could identify the problem.
I force-compiled the job, but it didn't fix things.

The sort order looks okay to me. See the above Score dump.
The job does not use RCP.

On the positive side, I have learned much more about the SCD stage!
Like how it doesn't quite handle multiple snapshots of input data,
e.g. if the primary link has a snapshot date, and say we have 2 days of data...
the SCD stage produces a set of upserts for Day1 and another set of upserts for Day2,
potentially resulting in 2 upserts for the same key (but for a Type1, we do not know which row to keep).