Changing from Single node to Multi node Configuration

srds2 · Post by **srds2** » Mon Feb 18, 2013 9:57 am

Hello,

We are migrating from 7.X to 8.X and we have jobs running on single node in 7.5 Environment. Now, we need to move it to Multi node in 8.x. Can anyone please tell the steps to be taken care for making the change in config file.
1)We have auto defined in all the partitioning criteria.So, does it mean that DS will take care even if we change to multi node.(Which I don't think so)
2)Will Lookup, Join,Merge where we have "auto" defined as default partitioning method on the input link to these stage yield proper results if we run on multi node without changing the partitioning method.
3)Any checklist that should be taken care of?

prasson_ibm · Post by **prasson_ibm** » Mon Feb 18, 2013 12:07 pm

If you are changing default config file from single node to multiple node,then you need not to add APT_CONFIG_FILE in your all jobs otherwise you need to add config file env. Variable.

For lookup you need not to change the auto partioning method untill unless you want "Entire" partition in reference link.

For Join,Merge stages,if input link is auto partitioned,data stage will automatically inserts tsort operatos and hash partition on input links.You can check it by enabling APT_DUMP_SCORE.

But to avoid repartitioning and other partitioning issues,it is recommended that you should explicitly spacify correct partition according to your job design.

srds2 · Post by **srds2** » Mon Feb 18, 2013 12:22 pm

Hello Prasson,

Thanks for the reply, Yes I do know configuring APT_CONFIG_FILE. The only question is the scenarios that might want a redesign when changing from single to multi node.

chulett · Post by **chulett** » Mon Feb 18, 2013 2:08 pm

One small point - I would not recommend changing your default config. Create a specific X node config (or configs) and leverage those when appropriate.

ray.wurlod · Post by **ray.wurlod** » Mon Feb 18, 2013 3:06 pm

The whole point of parallel jobs is that you do NOT need to redesign when running on a different number of nodes.

chulett · Post by **chulett** » Mon Feb 18, 2013 6:22 pm

... and yet we all know that doesn't always work out so well. If you've only ever run everything on a single node and never tested anything on more than that, then I would think you may have some surprises in store.

srds2 · Post by **srds2** » Tue Feb 19, 2013 7:58 am

I have already updated to be a premium member last month, but I couldn't yet see the premium content

srds2 · Post by **srds2** » Tue Feb 19, 2013 8:17 am

Hello Ray,

Thanks for the response...The jobs running on single node were never configured/designed to run on multi node(Like just leaving default value in Partition "Auto"). So, do you think it will yield good test results even though we convert to a multi node

Hey Chulett,

Thanks for the response. But, as my account is yet to be activated I couldn't see your full answer.

TeradataDICoE_JEB · Post by **TeradataDICoE_JEB** » Mon Feb 25, 2013 10:18 pm

chullet can you please provide more details on what specific surprises will result from this. I am in a similar situation where all jobs are single node and we are planning to change them all to a minimum of 2 nodes and adjust some higher volume jobs accordingly. I understand the problems are usually around partitioning, but do you have any specific examples or details around what the issue would look like.

Thanks.

BI-RMA · Post by **BI-RMA** » Tue Feb 26, 2013 8:15 am

In most situations autopartitioning will handle partitioning correctly when changing from a one-node to a multi-node configuration.

There are definitely situations where the default-partitioning-stategy used by DataStage will not yield the best result in terms of performance.

But there is also the risk that you exchange the content of a column in a job-flow. If this column is used for key-based partitioning before you do this, DataStage may not recognise that repartitioning is necessary before using the column as key in another operator needing key-based partitioning.

There are are a number of other scenarios where DataStage may be mistaken concerning the correct partitioning strategy. The results may be correct for 99% of the jobs, but the last percent may be very hard to identify when you simply switch from one to two nodes, because you will hardly see any warnings in the logs in case of errors.

TeradataDICoE_JEB · Post by **TeradataDICoE_JEB** » Tue Feb 26, 2013 9:18 am

Thanks for the response Roland.

So in summary it sounds like single node jobs that are set to auto partition prior to conversion, only present the issue of performance degradation due to repartitioning, after converting to multi-node.

If a specific partitioning method is being used (i.e. hash partitioning) in the single node job, then you run the risk on conversion to multi-node, that DataStage will make an incorrect decision (won't repartition) in situations where it should, such as when using the Merge stage, resulting in undetermined results.

Overall, the best approach is to determine on a case by case basis the parallelism of the job and the partitioning method. But in a case where you require a mass change you should only mass change all jobs set to auto partition and then manually change/test all other jobs that use a specific partitioning method. Of course monitoring would be key after either change. Does this sound like a reasonable approach?

BI-RMA · Post by **BI-RMA** » Tue Feb 26, 2013 9:50 am

Sorry to confuse You, but - No.

Hash partitioning is also the default partitioning strategy used by DataStage in preparation of join-, merge-, change-capture-, remove-duplicates- and the like operations.

So it does not matter whether you specified the partition-method manually or DataStage chose it automatically. That is the crux. If the content of a key-column is transformed without changing the column name this would change the hash-values for the rows. But DataStage may not realise this and choose Same-partitioning instead of repartitioning the content.

And - as I said - this is just one situation where DataStage may be in error. This is why it is a best-practise to develop all parallel-jobs in a two-node configuration and test them thorougly. You can't monitor a thousand parallel jobs closely after a mass change. You may not even know what you are looking for watching the logs.

TeradataDICoE_JEB · Post by **TeradataDICoE_JEB** » Tue Feb 26, 2013 4:46 pm

So now I am confused but let me attempt to restate what you said using a specific example and maybe that will clear things up.

Today I create a job that uses the merge stage to join 3 different sources of data (datasets) and then loads into a target table. Because of the requirements of the merge stage I explicitly hash partition my source data, pre-sort, and remove dups. I run this using a single-node config file and proceed to roll the job out to production

A month later I decide the job is running too slow and in order to speed it up I simply change the config file parameter to the 2 node config file rather than the 1 node.

If I did this without testing the results first, I think you are saying that it is possible that one or more of my data sources will not be partitioned correctly because DataStage will assume that same partitioning is appropriate (even though I explicitly selected hash partitioning). To elaborate, if the data in the source dataset happened to be partitioned using the round-robin method (created by a prior process) I could drop records unintentionally because the data is no longer being repartitioned prior to merge. Correct?

The same would be true if the original job had been set to auto partition...DataStage could incorrectly choose same partitioning. Correct?

The fact that DataStage makes optimization changes even when you explicitly select a partitioning method seems like a big problem...do you know if the later versions of DataStage (i.e. 8.7 or 9.1) still have this bug/feature?

As you stated this is one example of several, so in summary it sounds like it's not a good idea to ever mass change the parallelism of jobs from single node to multi-node...

ray.wurlod · Post by **ray.wurlod** » Tue Feb 26, 2013 7:39 pm

Optimization changes never affect explicitly-set partitioning.

Optimization will only insert partitioners where there are none.

BI-RMA · Post by **BI-RMA** » Wed Feb 27, 2013 12:47 am

Yes, this is what I - and Craig before me - stated: it is not a good idea to mass-change from single-node to multi-node processing.

Problems that arise from the change are rare, but they are very difficult to identify and debug.