SCD Type 2 producing incorrect inserts

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
Gazelle
Premium Member
Premium Member
Posts: 108
Joined: Mon Nov 24, 2003 11:36 pm
Location: Australia (Melbourne)

SCD Type 2 producing incorrect inserts

Post by Gazelle »

Problem:
  • - When rerunning the job with no changes to the source data, the SCD stage identifies all rows as new inserts.
    - This problem does not occur when running on a single node.
SCD configuration:
  • - Input partitioning tab: The dimension link partitioning=Auto; the input link partition=Hash on BAS_NK & SOURCE_SYSTEM_ID, and sorted on BAS_NK & SOURCE_SYSTEM_ID, and SNAPSHOT_TIMESTAMP.
    - Lookup tab purpose codes:
    Project_Baseline_Key=Surrogate Key
    BAS_NK=Business Key
    Baseline_Name=Type 2
    Description=Type 2
    Last_Update=Type 2
    Valid_From_Timestamp=Effective Date
    Valid_To_Timestamp=Expiration Date
    SOURCE_SYSTEM_ID=Business Key
    - Lookup tab Key Expressions: Only the BAS_NK & SOURCE_SYSTEM_ID are used, and all other columns are blank (i.e. not used for matching).
    - Surrogate Key tab: Uses a DB2 Sequence.
    - Dim Update tab:
    Project_Baseline_Key=NextSurrogateKey()
    BAS_NK=BAS_NK
    Baseline_Name=Baseline_Name
    Description=Description
    Last_Update=Last_Update
    Valid_From_Timestamp=SNAPSHOT_TIMESTAMP
    Valid_To_Timestamp='2999-12-31 23:59:59.000000'
    SOURCE_SYSTEM_ID=SOURCE_SYSTEM_ID
    - Output Map: Too many to list; the main thing is the Project_Baseline_Key is passed through, but not the other dimension columns.
Investigation:
  • - I had a look at the Score, but do not see the problem. I've reduced the job to just the problem SCD stage, and attached the Score:

Code: Select all

main_program: This step has 12 datasets:
ds0: {/apps/InformationServer/dsadmd/descriptors/TestInput.ds
      eAny=>eCollectAny
      op3[4p] (parallel dsTestInput)}
ds1: {op0[1p] (sequential db2DimBasOld)
      eAny<>eCollectAny
      op1[4p] (parallel copyPlaceholder4)}
ds2: {op1[4p] (parallel copyPlaceholder4)
      eAny=>eCollectAny
      op2[4p] (parallel copyPlaceholder4:RF02Before-FILTERED)}
ds3: {op2[4p] (parallel copyPlaceholder4:RF02Before-FILTERED)
      eOther(APT_HashPartitioner { key={ value=BAS_NK },
  key={ value=SOURCE_SYSTEM_ID }
})#>eCollectAny
      op6[4p] (parallel inserted tsort operator {key={value=BAS_NK, subArgs={asc, nulls={value=first}, cs}}, key={value=SOURCE_SYSTEM_ID, subArgs={asc, nulls={value=first}, cs}}}(0) in scdDimBaseline)}
ds4: {op3[4p] (parallel dsTestInput)
      eAny=>eCollectAny
      op4[4p] (parallel copyPartitionBasNK)}
ds5: {op4[4p] (parallel copyPartitionBasNK)
      eOther(APT_HashPartitioner { key={ value=BAS_NK, 
        subArgs={ cs }
      },
  key={ value=SOURCE_SYSTEM_ID, 
        subArgs={ cs }
      }
})#>eCollectAny
      op5[4p] (parallel scdDimBaseline.BA10_RepPerd_Sort)}
ds6: {op5[4p] (parallel scdDimBaseline.BA10_RepPerd_Sort)
      [pp] eSame=>eCollectAny
      op8[4p] (parallel APT_TransformOperatorImplV24S13_Debug01_FactPrjRepPerdPerf_scdDimBaseline in scdDimBaseline)}
ds7: {op6[4p] (parallel inserted tsort operator {key={value=BAS_NK, subArgs={asc, nulls={value=first}, cs}}, key={value=SOURCE_SYSTEM_ID, subArgs={asc, nulls={value=first}, cs}}}(0) in scdDimBaseline)
      [pp] eSame=>eCollectAny
      op7[4p] (parallel APT_LUTCreateOp in scdDimBaseline)}
ds8: {op7[4p] (parallel APT_LUTCreateOp in scdDimBaseline)
      eEntire#>eCollectAny
      op8[4p] (parallel APT_TransformOperatorImplV24S13_Debug01_FactPrjRepPerdPerf_scdDimBaseline in scdDimBaseline)}
ds9: {op7[4p] (parallel APT_LUTCreateOp in scdDimBaseline)
      eAny=>eCollectAny
      op8[4p] (parallel APT_TransformOperatorImplV24S13_Debug01_FactPrjRepPerdPerf_scdDimBaseline in scdDimBaseline)}
ds10: {op8[4p] (parallel APT_TransformOperatorImplV24S13_Debug01_FactPrjRepPerdPerf_scdDimBaseline in scdDimBaseline)
      eAny=>eCollectAny
      op9[4p] (parallel copyPlaceholder5)}
ds11: {op8[4p] (parallel APT_TransformOperatorImplV24S13_Debug01_FactPrjRepPerdPerf_scdDimBaseline in scdDimBaseline)
      [pp] eSame=>eCollectAny
      op10[4p] (parallel copyPlaceholder6)}
It has 11 operators:
op0[1p] {(sequential db2DimBasOld)
    on nodes (
      node1[op0,p0]
    )}
op1[4p] {(parallel copyPlaceholder4)
    on nodes (
      node1[op1,p0]
      node2[op1,p1]
      node3[op1,p2]
      node4[op1,p3]
    )}
op2[4p] {(parallel copyPlaceholder4:RF02Before-FILTERED)
    on nodes (
      node1[op2,p0]
      node2[op2,p1]
      node3[op2,p2]
      node4[op2,p3]
    )}
op3[4p] {(parallel dsTestInput)
    on nodes (
      node1[op3,p0]
      node2[op3,p1]
      node3[op3,p2]
      node4[op3,p3]
    )}
op4[4p] {(parallel copyPartitionBasNK)
    on nodes (
      node1[op4,p0]
      node2[op4,p1]
      node3[op4,p2]
      node4[op4,p3]
    )}
op5[4p] {(parallel scdDimBaseline.BA10_RepPerd_Sort)
    on nodes (
      node1[op5,p0]
      node2[op5,p1]
      node3[op5,p2]
      node4[op5,p3]
    )}
op6[4p] {(parallel inserted tsort operator {key={value=BAS_NK, subArgs={asc, nulls={value=first}, cs}}, key={value=SOURCE_SYSTEM_ID, subArgs={asc, nulls={value=first}, cs}}}(0) in scdDimBaseline)
    on nodes (
      node1[op6,p0]
      node2[op6,p1]
      node3[op6,p2]
      node4[op6,p3]
    )}
op7[4p] {(parallel APT_LUTCreateOp in scdDimBaseline)
    on nodes (
      node1[op7,p0]
      node2[op7,p1]
      node3[op7,p2]
      node4[op7,p3]
    )}
op8[4p] {(parallel APT_TransformOperatorImplV24S13_Debug01_FactPrjRepPerdPerf_scdDimBaseline in scdDimBaseline)
    on nodes (
      node1[op8,p0]
      node2[op8,p1]
      node3[op8,p2]
      node4[op8,p3]
    )}
op9[4p] {(parallel copyPlaceholder5)
    on nodes (
      node1[op9,p0]
      node2[op9,p1]
      node3[op9,p2]
      node4[op9,p3]
    )}
op10[4p] {(parallel copyPlaceholder6)
    on nodes (
      node1[op10,p0]
      node2[op10,p1]
      node3[op10,p2]
      node4[op10,p3]
    )}
It runs 41 processes on 4 nodes.

Since it works on 1 node, my guess is that I've messed up the partitioning somehow. Or is it that we've used '2999-12-31 23:59:59.000000' to indicate the current rows, instead of null?
I'm sure it'll be something simple, and I'll scamper off embarrassed and with my tail between my legs.
But I just can't put my finger on the problem.

Any ideas?
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Off the top of my head, I'd try setting the dimension link partitioning to match the input link: Hash on BAS_NK & SOURCE_SYSTEM_ID.
-craig

"You can never have too many knives" -- Logan Nine Fingers
Gazelle
Premium Member
Premium Member
Posts: 108
Joined: Mon Nov 24, 2003 11:36 pm
Location: Australia (Melbourne)

Post by Gazelle »

Thanks for the response Craig, but it didn't fix things.

The Score now shows:

Code: Select all

.
.
.
ds2: {op1[4p] (parallel copyPlaceholder4)
      eAny=>eCollectAny
      op2[4p] (parallel copyPlaceholder4:RF02Before-FILTERED)}
ds3: {op1[4p] (parallel copyPlaceholder4)
      eAny=>eCollectAny
      op7[4p] (parallel APT_TransformOperatorImplV0S208_Debug01_FactPrjRepPerdPerf_Transformer_208 in Transformer_208)}
ds4: {op2[4p] (parallel copyPlaceholder4:RF02Before-FILTERED)
      eOther(APT_HashPartitioner { key={ value=BAS_NK },
  key={ value=SOURCE_SYSTEM_ID }
})#>eCollectAny
      op9[4p] (parallel inserted tsort operator {key={value=BAS_NK, subArgs={asc, nulls={value=first}, cs}}, key={value=SOURCE_SYSTEM_ID, subArgs={asc, nulls={value=first}, cs}}}(0) in scdDimBaseline)}

.
.
.
I have also copied the two sets of keys to a transformer and converted them to binary (to check for any non-printable characters), but they look like the keys should match.
After all, they match when using 1 node.
priyadarshikunal
Premium Member
Premium Member
Posts: 1735
Joined: Thu Mar 01, 2007 5:44 am
Location: Troy, MI

Post by priyadarshikunal »

Also check the datatype of the hash keys, there should be a exact match. Remember that the Hash for unicode string is different from non-unicode one even if the string is same.

If it works on single node then the wrong partitioning has the highest probablity.
Last edited by priyadarshikunal on Mon Jun 04, 2012 5:25 am, edited 1 time in total.
Priyadarshi Kunal

Genius may have its limitations, but stupidity is not thus handicapped. :wink:
Gazelle
Premium Member
Premium Member
Posts: 108
Joined: Mon Nov 24, 2003 11:36 pm
Location: Australia (Melbourne)

Post by Gazelle »

Wow, how did you do that!? It's as though you're looking over my shoulder.
Yes, I had missed that the input column was Unicode and the dimension column wasn't.

But correcting this oversight still didn't fix the problem.

Stranger still, it still shouldn't find a match when running on 1 node.

I'm getting even more puzzled by this problem.

Any ideas on how to approach debugging this one?
priyadarshikunal
Premium Member
Premium Member
Posts: 1735
Joined: Thu Mar 01, 2007 5:44 am
Location: Troy, MI

Post by priyadarshikunal »

I did that mistake myself once and took me hours to figure that out. May be just because I was looking at other parts to find the mistake.

I would probably look at the sort order (including case sensitive and null position) , column metadata, partitioning method and its order and may be at RCP as well. and after verifying all, a force compile.
Priyadarshi Kunal

Genius may have its limitations, but stupidity is not thus handicapped. :wink:
priyadarshikunal
Premium Member
Premium Member
Posts: 1735
Joined: Thu Mar 01, 2007 5:44 am
Location: Troy, MI

Post by priyadarshikunal »

Or you might just need another pair of eyes to verify the job. :wink:
Priyadarshi Kunal

Genius may have its limitations, but stupidity is not thus handicapped. :wink:
Gazelle
Premium Member
Premium Member
Posts: 108
Joined: Mon Nov 24, 2003 11:36 pm
Location: Australia (Melbourne)

Post by Gazelle »

Yes, I've had a couple of other sets of eyes look at it, but none of us could identify the problem.
I force-compiled the job, but it didn't fix things.

The sort order looks okay to me. See the above Score dump.
The job does not use RCP.

On the positive side, I have learned much more about the SCD stage!
Like how it doesn't quite handle multiple snapshots of input data,
e.g. if the primary link has a snapshot date, and say we have 2 days of data...
the SCD stage produces a set of upserts for Day1 and another set of upserts for Day2,
potentially resulting in 2 upserts for the same key (but for a Type1, we do not know which row to keep).
Post Reply