Changing OPT_CONFIG_FILE and Duplicate data

itsvarunm · Post by **itsvarunm** » Wed Sep 26, 2012 4:56 am

Hi,

I have created my DS jobs without any partitioning and on a single node configuration. If I change these jobs to run on a multi node environment , is there any chance of duplicates getting generated in data? I remember I had faced such an issue some time back in another project. Please advice

Thanks,
Varun

BI-RMA · Post by **BI-RMA** » Wed Sep 26, 2012 6:24 am

It is almost impossible to construct a situation where a change from a single-node configuration to a multi-node configuration will produce duplicates. A single row is going to be sent downstream either one partition or the other, but not more than one way. Auto-partitioning will usually handle all sorting and repartitioning necessities, but may in some cases not choose the most efficient partitioning strategy.

In multi-node configurations it is much more likely that you loose rows by processing matching keys on different nodes in a join operation for example. This is something You have to be very careful of when switching to specific partitioning-options instead of "auto".

ulab · Post by **ulab** » Wed Sep 26, 2012 6:29 am

Hi Varun,

what was the partition name you used in your job?

ArndW · Post by **ArndW** » Wed Sep 26, 2012 6:50 am

I disagree with BI-RMA. When going from a 1-node configuration (and design) to a multi-node configuration there are several conditions where the results between the two environments may be different. The outcome depends upon factors such as the hashing algorithms chosen and the settings of "APT_NO_PART_INSERTION", "APT_NO_SORT_INSERTION" and "APT_NOPARTSORT_OPTIMIZATION".

Those stages which have 2 or more input links are those which are most often be affected. Most common among these are the "join", "lookup" where problems manifest themselves by dropped records or additional records depending upon the situation.

Take an inner join developed on a 1-node machine with keys 1,2 and 3 coming from both left and right input links. This will always find matches in a 1-node configuration. If one changes to a 2 node configuration and hashes on the key in one link and using round-robin in the other, then it is likely that the keys are not distributed across the partitions identically and that some of the joins will therefore not match up.

This is a very common error and is a reason why I always advocate using a 2-node configuration in development regardless of the machine size. A job that partitions correctly with 2-nodes will always work in a 1-node environment and in other multinode configurations (at least with respect to partitioning).

BI-RMA · Post by **BI-RMA** » Wed Sep 26, 2012 9:18 am

Hi Arnd,

Mind you that I did not say the results will be the same. I said I can't think of a situation where going from one to two nodes will duplicate data.

Loosing records is an entirely other matter and I adressed this explicitly.

jwiles · Post by **jwiles** » Wed Sep 26, 2012 10:07 am

Selecting "Entire" partitioning somewhere for the main data stream will duplicate date when running in a multi-node configuration. An unusual--and normally invalid--choice for main data, but this is one situation that will produce duplicates.

Regards,

ArndW · Post by **ArndW** » Wed Sep 26, 2012 10:39 am

BI-RMA wrote:...I said I can't think of a situation where going from one to two nodes will duplicate data...

The only one I can think of was already mentioned, an ill-placed "entire" partitioning will do that.