Change Capture and unique constraints

ASU_ETL_DEV · Post by **ASU_ETL_DEV** » Tue Jun 16, 2015 1:08 pm

Hello,
In order to discover the latest source table's inserts, updates and deletes (two-column primary key and additional unique index built on another column) I coded a change capture job flow composed of the steps outined below.
Job 5) aborts with over 50 warnings of unique constraint messages on the secondary unique index.
All primary key and secondary unique index columns have been flagged as key columns throughout the jobs. The jobs are running with a configuration file with 4 nodes.

I am trying to figure out whether I need to sequence the updates by first applying the deletes then the inserts and then the updates or whether the flow and options I chose have fundamental flaws.

Any help is appreciated.

Thanks.
Marco

Code: Select all

1) 
Sequence job that is controlling four parallel jobs: it executes job 2) and 3) concurrently and then 4) and 5) sequentially.

Code: Select all

2) 
After job
   An Oracle Connector (Preserve partitioning=propagate; 
                        Execution mode=parallel) with Round Robin partitioned reads links to a 
   Transformer (Preserve partitioning=propagate; 
                Combinability mode=auto; 
                Execution mode=propagate) which links to an 
   (After) Dataset stage (Partition type=auto; 
                          Combinability mode=auto; 
                          Execution mode=parallel)

Code: Select all

3) 
Before job
   An Oracle Connector (Preserve partitioning=propagate; 
                        Execution mode=parallel) with Round Robin partitioned reads links to a 
   (Before) Dataset stage (Partition type=auto; 
                           Combinability mode=auto; 
                           Execution mode=parallel)

Code: Select all

4) 
Change Capture job
   After Dataset stage (Preserve partitioning=propagate; 
                        Combinability mode=auto; 
                        Execution mode=parallel) together with the 
   Before Dataset stage (Preserve partitioning=propagate; 
                         Combinability mode=auto; 
                         Execution mode=parallel) link to a 
   Change Capture stage (Change keys=primary key and unique index columns; 
                         Change mode=Explicit keys, All values; 
                         Drop out for Copy=True; 
                         Drop out for Delete/Edit/Insert=False;
                         Execution mode=parallel; 
                         Combinability mode=auto; 
                         Preserve partitioning=Default(set);
                         INPUT TAB (both After and Before links): Partitioning type=Hash; 
                                                                  Perform sort=Yes, on the three key columns with usage=Sorting, Partitioning/Ascending) which links to a 
   (CCapture) Dataset stage (Partition type=auto; 
                             Combinability mode=auto; 
                             Execution mode=parallel)

Code: Select all

5) 
Update target job
   CCapture Dataset stage (Execution mode=parallel; 
                           Combinability mode=auto; 
                           Preserve partitioning=propagate) links to a 
   Transformer stage (Execution mode=parallel; 
                      Combinability mode=auto; 
                      Preserve partitioning=propagate;
                      INPUT TAB: Partition type=Auto) which links to three
   (INS, UPD, DEL) Oracle Connectors (Execution mode=parallel;
                                      Preserve partitioning=propagate)

priyadarshikunal · Post by **priyadarshikunal** » Wed Jun 17, 2015 12:27 am

Did you verify that the unique constraint violation is for the columns you are mentioning as key in change capture, are there any other unique key such as the surrogate key which is generating old values?

Does it run without problem on one node?

ASU_ETL_DEV · Post by **ASU_ETL_DEV** » Wed Jun 17, 2015 11:50 am

Thank you for answering.

I verified and the unique constraint error is actually happening on the primary key index (they have very similar names), not the secondary unique one. It is a two-column primary key that is correctly defined in the table and throughout the jobs.

I am running the job on one node now and will report back with the results.

chulett · Post by **chulett** » Wed Jun 17, 2015 12:08 pm

So if that does "fix" it and you want to go back to running on multiple nodes, we'll need to look at your partitioning.

ASU_ETL_DEV · Post by **ASU_ETL_DEV** » Tue Jun 23, 2015 5:01 pm

The run on one node did not fix it. Same story. I changed the secondary index to non-unique and did not flag its column as key in any of the jobs and the whole thing started working without further problem. I am wondering whether the Change Capture stage does not support multiple unique indexes and so we need to drop all but one.

priyadarshikunal · Post by **priyadarshikunal** » Wed Jun 24, 2015 2:13 am

that means there is logic issue, it just means that the record which were unique on primary index, were not unique based on secondary index. You should look at data whether you can have both unique indexes or not.

ray.wurlod · Post by **ray.wurlod** » Wed Jun 24, 2015 3:37 am

I'm confused (which doesn't happen often). What have indexes got to do with the Change Capture stage, which processes streams of data?

ASU_ETL_DEV · Post by **ASU_ETL_DEV** » Wed Jun 24, 2015 9:37 am

In the property page of the Change Capture stage, in the Change Keys section, I specified the primary key columns. In the options section, as Change Mode, I selected "Explicit keys, all values".
In the Input tab I specified Hash partitioning with sort on the primary key columns for both Before and After datasets.
I fed the resulting dataset to three Oracle Connectors which processed the data according to the value in ChangeCode() (insert, update, delete). The "insert" connector is the one that gave the unique constraint messages.

ASU_ETL_DEV · Post by **ASU_ETL_DEV** » Wed Jun 24, 2015 4:20 pm

If I completely reload the target with the same set of jobs I do not get this unique constraint error on the insert so I am wondering why this happens on an incremental load.
Note that I am just staging the source data to the target table and the source table has the same primary index and secondary unique index defined so the data should be good, no duplicate keys.
If I update the target through Oracle connectors that are all active at the same time, is there a specific sequence in which the updates are applied to the target table (e.g. deletes first, then inserts and updates)?

ray.wurlod · Post by **ray.wurlod** » Wed Jun 24, 2015 5:57 pm

None of that tells me anything about why you believe the stage makes any use of indexes. What it actually makes use of is the fact that the data are sorted by the defined key.

Mike · Post by **Mike** » Thu Jun 25, 2015 7:31 am

And the change capture stage expects unique keys on both of its input links.

You won't get a warning or fatal error if there are duplicate keys, but the generated change_code may be inaccurate.

For debugging purposes, check for duplicate keys on each input link.

Mike

ASU_ETL_DEV · Post by **ASU_ETL_DEV** » Thu Jun 25, 2015 8:54 am

Ray, I was not implying that the stage used the Oracle indexes or any other index. But the stage sorts the before/after data according to key columns that in my case reflect the Oracle tables' primary key columns.
During the change discovery the stage assigns the insert change code to records that later cause unique constraint.

Mike, NLS is turned on and the source database has a different character set than the target so maybe the after and before datasets are not created equal in the key columns (and other columns as well).