Duplicate data in source

Suruchi · Post by **Suruchi** » Thu Oct 04, 2012 8:21 am

While using the option as upsert(update then insert) in db2 connector stage (DataStage Parallel Version 8.5), is it possible that records will be rejected if we have duplicate data in the source? Ideally everything from source should be updated even if we have duplicates..

ArndW · Post by **ArndW** » Thu Oct 04, 2012 8:36 am

Generally one must always assume that every step can somehow go wrong.

But if you are doing update-then-insert then you shouldn't be getting rejects due to primary key constraints (but you would get a rejected record if a column that was non-nullable were updated with a null value).

PhilHibbs · Post by **PhilHibbs** » Thu Oct 04, 2012 8:56 am

Suruchi wrote:While using the option as upsert(update then insert) in db2 connector stage (DataStage Parallel Version 8.5), is it possible that records will be rejected if we have duplicate data in the source?

If two rows passing through your job both have the same values in the columns that are ticked as "Key" columns, then each will update the same row (or set of rows if you already have two rows with the same "Key" values). DataStage will not throw an error for this. You should make sure that your partitioning sends both these rows to the same node though or you will get contention between the nodes and both poor performance and unpredictable behaviour will result. If they are in the same node, then you will know that the last row will be the one that the result will reflect.

Probably it would be better to put in a Remove Duplicates stage to save your database from the additional load of performing redundant updates.

Suruchi · Post by **Suruchi** » Fri Oct 05, 2012 12:21 am

Thanks Phil for you inputs.
I have specified the partitioning as db2 connector in db2 connector stage where I have put update and insert mode.Also the table I am loading is partitioned table.As far as nulls are concerned, I have rejected them through transformer stage just before upserting them to table.
In lower environments even when I am getting duplicates in the source,all records are updated but in production records are rejected.One of the reasons I believe is because lower environments have single node so we are not facing any issue but in production we have two nodes so while performing updates some records are rejected.
But just to make sure if its the environment or partitioning that is causing such erroneous behaviour or anything else.Am I missing something?

DSXchange

Duplicate data in source

Duplicate data in source

Re: Duplicate data in source