Change Detect Capture(CDC) Stage overview

Amin · Post by **Amin** » Mon Mar 16, 2015 7:14 am

My job Design is Like

Source1==>SortStage1==>
                        CDC Stage ==> SortStage3 ==> Target
Source2==>SortStage2==>

My questions are;
1: As CDC has built in option regarding sorting data so, before CDC using Sort stage is bad approach and is this performance overhead?
2: Is it CDC Stage output data is sorted or not?
3: Is we need to use Sort Stage as using in above design SortStage3 after CDC for sorting data or output data is already sorted?
4: Is CDC stage compare data in parallel or sequential mode.If more than one server available?

ray.wurlod · Post by **ray.wurlod** » Mon Mar 16, 2015 3:27 pm

1. Input link sorting is identical to Sort stage except that Sort stage gives more flexibility (for example allocation of memory to sort, generation of key change column).

2. Probably, since its input is sorted. There's nothing within the stage to change the sorted order of rows processed. However, if you re-partition downstream of the stage, all bets are off.

3. See 2.

4. Parallel (irrespective of the number of servers available). So you must ensure that your data are correctly partitioned.

Amin · Post by **Amin** » Tue Mar 17, 2015 7:58 am

#1,#2: Can you kindly provide some detail.

#4: If CDC stage execution is parallel then how it compare data If same KEY records of before data set is on one server and after data set record is on other server.
Kindly review this.....

"ray.wurlod" Thanks for reply

ray.wurlod · Post by **ray.wurlod** » Tue Mar 17, 2015 3:34 pm

More detail? Not really. All sorting in DataStage parallel jobs uses the tsort operator. The Sort stage gives more options than input link sorting and than inserted tsort operators.

The data have to come together (on the same server) to be processed by the stage. Correct partitioning will guarantee key adjacency, so change can reliably be detected because both (all) relevant records will be together.