Changing OPT_CONFIG_FILE and Duplicate data

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
itsvarunm
Participant
Posts: 10
Joined: Fri Aug 31, 2012 6:33 am

Changing OPT_CONFIG_FILE and Duplicate data

Post by itsvarunm »

Hi,

I have created my DS jobs without any partitioning and on a single node configuration. If I change these jobs to run on a multi node environment , is there any chance of duplicates getting generated in data? I remember I had faced such an issue some time back in another project. Please advice

Thanks,
Varun
BI-RMA
Premium Member
Premium Member
Posts: 463
Joined: Sun Nov 01, 2009 3:55 pm
Location: Hamburg

Post by BI-RMA »

It is almost impossible to construct a situation where a change from a single-node configuration to a multi-node configuration will produce duplicates. A single row is going to be sent downstream either one partition or the other, but not more than one way. Auto-partitioning will usually handle all sorting and repartitioning necessities, but may in some cases not choose the most efficient partitioning strategy.

In multi-node configurations it is much more likely that you loose rows by processing matching keys on different nodes in a join operation for example. This is something You have to be very careful of when switching to specific partitioning-options instead of "auto".
"It is not the lucky ones are grateful.
There are the grateful those are happy." Francis Bacon
ulab
Participant
Posts: 56
Joined: Mon Mar 16, 2009 4:58 am
Location: bangalore
Contact:

Post by ulab »

Hi Varun,

what was the partition name you used in your job?
Ulab----------------------------------------------------
help, it helps you today or Tomorrow
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

I disagree with BI-RMA. When going from a 1-node configuration (and design) to a multi-node configuration there are several conditions where the results between the two environments may be different. The outcome depends upon factors such as the hashing algorithms chosen and the settings of "APT_NO_PART_INSERTION", "APT_NO_SORT_INSERTION" and "APT_NOPARTSORT_OPTIMIZATION".

Those stages which have 2 or more input links are those which are most often be affected. Most common among these are the "join", "lookup" where problems manifest themselves by dropped records or additional records depending upon the situation.

Take an inner join developed on a 1-node machine with keys 1,2 and 3 coming from both left and right input links. This will always find matches in a 1-node configuration. If one changes to a 2 node configuration and hashes on the key in one link and using round-robin in the other, then it is likely that the keys are not distributed across the partitions identically and that some of the joins will therefore not match up.

This is a very common error and is a reason why I always advocate using a 2-node configuration in development regardless of the machine size. A job that partitions correctly with 2-nodes will always work in a 1-node environment and in other multinode configurations (at least with respect to partitioning).
BI-RMA
Premium Member
Premium Member
Posts: 463
Joined: Sun Nov 01, 2009 3:55 pm
Location: Hamburg

Post by BI-RMA »

Hi Arnd,

Mind you that I did not say the results will be the same. I said I can't think of a situation where going from one to two nodes will duplicate data.

Loosing records is an entirely other matter and I adressed this explicitly.
"It is not the lucky ones are grateful.
There are the grateful those are happy." Francis Bacon
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

Selecting "Entire" partitioning somewhere for the main data stream will duplicate date when running in a multi-node configuration. An unusual--and normally invalid--choice for main data, but this is one situation that will produce duplicates.

Regards,
- james wiles


All generalizations are false, including this one - Mark Twain.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

BI-RMA wrote:...I said I can't think of a situation where going from one to two nodes will duplicate data...
The only one I can think of was already mentioned, an ill-placed "entire" partitioning will do that.
Post Reply