CDC Stage producing strange results

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
emblem_user
Premium Member
Premium Member
Posts: 8
Joined: Fri Sep 02, 2011 9:33 am

CDC Stage producing strange results

Post by emblem_user »

I am currently experiencing some strange behavior with the Change Data Capture stage in one of my jobs. Below is a diagram of the job.

Code: Select all

           Oracle (Before)
                    |                ______________>Dataset (Deletes)
                    |               /
                    V             /
Dataset (After)--->CDC--->Transformer----------->Dataset (Edits)
                                   \
                                    \
                                     -------------------->Dataset (Inserts)

The CDC stage is hash partitioned on both input links heading into the CDC stage. For both links, the hash partition is defined on the same 2 fields (MBR_DEMO_KEY and ADDR_TYPE), in the same order, and with the same sorting parameters. The CDC stage has those same 2 fields defined as the change keys.

For a given member that I was researching, there is only one record in the After set with the given MBR_DEMO_KEY and likewise there is only one record in the Before set with the given MBR_DEMO_KEY. Both records have the same ADDR_TYPE. For some reason this single record is producing 2 output records from the CDC, an edit and a delete.

I understand that it could be possible for there to be a delete and an insert generated if, for example, the ADDR_TYPE was different for both records with the same MBR_DEMO_KEY. An edit and a delete makes no sense though.

I decided to go to the QA environment and see if I could test out a few changes that might fix this problem. The first thing I did was to run the code unaltered to replicate the problem. In our QA environment, the same code/data produced the desired result (just a single edit record).

What could possibly be causing this strange behavior in our production environment?

Any help would be greatly appreciated!
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

That really does sound like a strange problem indeed. Can you make a test job, copied from you original one where you do an inner join instead of CDC stage and just output a peek stage. This would let you see if DataStage actually sees the 2 keys as being identical for this set of records - for instance if the keys are strings and there is a hidden space or unprintable character in there somewhere.
SURA
Premium Member
Premium Member
Posts: 1229
Joined: Sat Jul 14, 2007 5:16 am
Location: Sydney

Re: CDC Stage producing strange results

Post by SURA »

If found these type of issues before and that is because, we are not comparing apple to apple. In out point of view, same OS,DS,DB. But in machine point of view, different H/W, drivers, etc etc. Especially i found difference in Date fields.


What is the data type for the key columns and the change value columns?

Fine to narrow down this issue, As Arndw said, create a copy of the job in the PROD itself and write the target into a peek,

Just pull few records and find what is happening. Still it is not working as expected, then create a job to write the source and reference data into a Sequential find and see how the data is for the Key coolumn and the change values are?

In other way use a Single node file and run the same job and see how it is handling?

Let me know how you go with this?
Thanks
Ram
----------------------------------
Revealing your ignorance is fine, because you get a chance to learn.
emblem_user
Premium Member
Premium Member
Posts: 8
Joined: Fri Sep 02, 2011 9:33 am

Post by emblem_user »

I found the problem. The Oracle before set was in fact returning multiple (identical) rows. While the ODS table only contains one row for the member, it was being joined to a subquery which was producing multiple rows. Apparently one of the tables was updated in production after the data had been copied to QA. As a result, the subquery returned one row in QA and two in production. Thanks for the help!
Post Reply