I have two datasets (basically same file), I sort them (set to be hash partitioned on key cols), and then pass them into a Change Data Capture stage. In CDC i mark 2 cols, out of three as key cols, and collect the ouput into another dataset.
When I observe in Director logs, however I see a warning , that reads
" Change_Capture1: When Checking Operator: Defaulting <Col Name> in transfer from beforeRec to outputRec"
where Col_Name is the third col which is NOT marked as key col in CDC.
Am I missing something very obvious?
Also, I am undecided on the best stage [Differnce / CDC] to be used for the following requirement:
"Compare today's dataset with yesterday's dataset; and propogate only those which have been updated or added (compare by key cols)"
Searching on the forums... i found that most of ppl think that "difference" can be used if the dataset is small, for larger datasets, everyone recommends a CDC, is this valid? why/ how?
![Confused :?](./images/smilies/icon_confused.gif)