Duplicate data in source

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
Suruchi
Participant
Posts: 16
Joined: Tue Dec 13, 2011 3:21 am

Duplicate data in source

Post by Suruchi »

While using the option as upsert(update then insert) in db2 connector stage (DataStage Parallel Version 8.5), is it possible that records will be rejected if we have duplicate data in the source? Ideally everything from source should be updated even if we have duplicates..
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Generally one must always assume that every step can somehow go wrong.

But if you are doing update-then-insert then you shouldn't be getting rejects due to primary key constraints (but you would get a rejected record if a column that was non-nullable were updated with a null value).
PhilHibbs
Premium Member
Premium Member
Posts: 1044
Joined: Wed Sep 29, 2004 3:30 am
Location: Nottingham, UK
Contact:

Re: Duplicate data in source

Post by PhilHibbs »

Suruchi wrote:While using the option as upsert(update then insert) in db2 connector stage (DataStage Parallel Version 8.5), is it possible that records will be rejected if we have duplicate data in the source?
If two rows passing through your job both have the same values in the columns that are ticked as "Key" columns, then each will update the same row (or set of rows if you already have two rows with the same "Key" values). DataStage will not throw an error for this. You should make sure that your partitioning sends both these rows to the same node though or you will get contention between the nodes and both poor performance and unpredictable behaviour will result. If they are in the same node, then you will know that the last row will be the one that the result will reflect.

Probably it would be better to put in a Remove Duplicates stage to save your database from the additional load of performing redundant updates.
Phil Hibbs | Capgemini
Technical Consultant
Suruchi
Participant
Posts: 16
Joined: Tue Dec 13, 2011 3:21 am

Post by Suruchi »

Thanks Phil for you inputs.
I have specified the partitioning as db2 connector in db2 connector stage where I have put update and insert mode.Also the table I am loading is partitioned table.As far as nulls are concerned, I have rejected them through transformer stage just before upserting them to table.
In lower environments even when I am getting duplicates in the source,all records are updated but in production records are rejected.One of the reasons I believe is because lower environments have single node so we are not facing any issue but in production we have two nodes so while performing updates some records are rejected.
But just to make sure if its the environment or partitioning that is causing such erroneous behaviour or anything else.Am I missing something?
Post Reply