Changing from Single node to Multi node Configuration
Moderators: chulett, rschirm, roy
Changing from Single node to Multi node Configuration
Hello,
We are migrating from 7.X to 8.X and we have jobs running on single node in 7.5 Environment. Now, we need to move it to Multi node in 8.x. Can anyone please tell the steps to be taken care for making the change in config file.
1)We have auto defined in all the partitioning criteria.So, does it mean that DS will take care even if we change to multi node.(Which I don't think so)
2)Will Lookup, Join,Merge where we have "auto" defined as default partitioning method on the input link to these stage yield proper results if we run on multi node without changing the partitioning method.
3)Any checklist that should be taken care of?
We are migrating from 7.X to 8.X and we have jobs running on single node in 7.5 Environment. Now, we need to move it to Multi node in 8.x. Can anyone please tell the steps to be taken care for making the change in config file.
1)We have auto defined in all the partitioning criteria.So, does it mean that DS will take care even if we change to multi node.(Which I don't think so)
2)Will Lookup, Join,Merge where we have "auto" defined as default partitioning method on the input link to these stage yield proper results if we run on multi node without changing the partitioning method.
3)Any checklist that should be taken care of?
-
- Premium Member
- Posts: 536
- Joined: Thu Oct 11, 2007 1:48 am
- Location: Bangalore
If you are changing default config file from single node to multiple node,then you need not to add APT_CONFIG_FILE in your all jobs otherwise you need to add config file env. Variable.
For lookup you need not to change the auto partioning method untill unless you want "Entire" partition in reference link.
For Join,Merge stages,if input link is auto partitioned,data stage will automatically inserts tsort operatos and hash partition on input links.You can check it by enabling APT_DUMP_SCORE.
But to avoid repartitioning and other partitioning issues,it is recommended that you should explicitly spacify correct partition according to your job design.
For lookup you need not to change the auto partioning method untill unless you want "Entire" partition in reference link.
For Join,Merge stages,if input link is auto partitioned,data stage will automatically inserts tsort operatos and hash partition on input links.You can check it by enabling APT_DUMP_SCORE.
But to avoid repartitioning and other partitioning issues,it is recommended that you should explicitly spacify correct partition according to your job design.
Thanks
Prasoon
ETL Consultant
LinkedIn :- http://www.linkedin.com/profile/view?id ... ab_pro_top
Blog:- http://dsshar.blogspot.com/
Prasoon
ETL Consultant
LinkedIn :- http://www.linkedin.com/profile/view?id ... ab_pro_top
Blog:- http://dsshar.blogspot.com/
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Hello Ray,
Thanks for the response...The jobs running on single node were never configured/designed to run on multi node(Like just leaving default value in Partition "Auto"). So, do you think it will yield good test results even though we convert to a multi node
Hey Chulett,
Thanks for the response. But, as my account is yet to be activated I couldn't see your full answer.
Thanks for the response...The jobs running on single node were never configured/designed to run on multi node(Like just leaving default value in Partition "Auto"). So, do you think it will yield good test results even though we convert to a multi node
Hey Chulett,
Thanks for the response. But, as my account is yet to be activated I couldn't see your full answer.
-
- Premium Member
- Posts: 3
- Joined: Fri Apr 22, 2011 3:12 pm
- Location: NW Arkansas
- Contact:
chullet can you please provide more details on what specific surprises will result from this. I am in a similar situation where all jobs are single node and we are planning to change them all to a minimum of 2 nodes and adjust some higher volume jobs accordingly. I understand the problems are usually around partitioning, but do you have any specific examples or details around what the issue would look like.
Thanks.
Thanks.
Jonathan Beckford
Teradata DI Center Of Excellence
Teradata DI Center Of Excellence
In most situations autopartitioning will handle partitioning correctly when changing from a one-node to a multi-node configuration.
There are definitely situations where the default-partitioning-stategy used by DataStage will not yield the best result in terms of performance.
But there is also the risk that you exchange the content of a column in a job-flow. If this column is used for key-based partitioning before you do this, DataStage may not recognise that repartitioning is necessary before using the column as key in another operator needing key-based partitioning.
There are are a number of other scenarios where DataStage may be mistaken concerning the correct partitioning strategy. The results may be correct for 99% of the jobs, but the last percent may be very hard to identify when you simply switch from one to two nodes, because you will hardly see any warnings in the logs in case of errors.
There are definitely situations where the default-partitioning-stategy used by DataStage will not yield the best result in terms of performance.
But there is also the risk that you exchange the content of a column in a job-flow. If this column is used for key-based partitioning before you do this, DataStage may not recognise that repartitioning is necessary before using the column as key in another operator needing key-based partitioning.
There are are a number of other scenarios where DataStage may be mistaken concerning the correct partitioning strategy. The results may be correct for 99% of the jobs, but the last percent may be very hard to identify when you simply switch from one to two nodes, because you will hardly see any warnings in the logs in case of errors.
"It is not the lucky ones are grateful.
There are the grateful those are happy." Francis Bacon
There are the grateful those are happy." Francis Bacon
-
- Premium Member
- Posts: 3
- Joined: Fri Apr 22, 2011 3:12 pm
- Location: NW Arkansas
- Contact:
Thanks for the response Roland.
So in summary it sounds like single node jobs that are set to auto partition prior to conversion, only present the issue of performance degradation due to repartitioning, after converting to multi-node.
If a specific partitioning method is being used (i.e. hash partitioning) in the single node job, then you run the risk on conversion to multi-node, that DataStage will make an incorrect decision (won't repartition) in situations where it should, such as when using the Merge stage, resulting in undetermined results.
Overall, the best approach is to determine on a case by case basis the parallelism of the job and the partitioning method. But in a case where you require a mass change you should only mass change all jobs set to auto partition and then manually change/test all other jobs that use a specific partitioning method. Of course monitoring would be key after either change. Does this sound like a reasonable approach?
So in summary it sounds like single node jobs that are set to auto partition prior to conversion, only present the issue of performance degradation due to repartitioning, after converting to multi-node.
If a specific partitioning method is being used (i.e. hash partitioning) in the single node job, then you run the risk on conversion to multi-node, that DataStage will make an incorrect decision (won't repartition) in situations where it should, such as when using the Merge stage, resulting in undetermined results.
Overall, the best approach is to determine on a case by case basis the parallelism of the job and the partitioning method. But in a case where you require a mass change you should only mass change all jobs set to auto partition and then manually change/test all other jobs that use a specific partitioning method. Of course monitoring would be key after either change. Does this sound like a reasonable approach?
Jonathan Beckford
Teradata DI Center Of Excellence
Teradata DI Center Of Excellence
Sorry to confuse You, but - No.
Hash partitioning is also the default partitioning strategy used by DataStage in preparation of join-, merge-, change-capture-, remove-duplicates- and the like operations.
So it does not matter whether you specified the partition-method manually or DataStage chose it automatically. That is the crux. If the content of a key-column is transformed without changing the column name this would change the hash-values for the rows. But DataStage may not realise this and choose Same-partitioning instead of repartitioning the content.
And - as I said - this is just one situation where DataStage may be in error. This is why it is a best-practise to develop all parallel-jobs in a two-node configuration and test them thorougly. You can't monitor a thousand parallel jobs closely after a mass change. You may not even know what you are looking for watching the logs.
Hash partitioning is also the default partitioning strategy used by DataStage in preparation of join-, merge-, change-capture-, remove-duplicates- and the like operations.
So it does not matter whether you specified the partition-method manually or DataStage chose it automatically. That is the crux. If the content of a key-column is transformed without changing the column name this would change the hash-values for the rows. But DataStage may not realise this and choose Same-partitioning instead of repartitioning the content.
And - as I said - this is just one situation where DataStage may be in error. This is why it is a best-practise to develop all parallel-jobs in a two-node configuration and test them thorougly. You can't monitor a thousand parallel jobs closely after a mass change. You may not even know what you are looking for watching the logs.
"It is not the lucky ones are grateful.
There are the grateful those are happy." Francis Bacon
There are the grateful those are happy." Francis Bacon
-
- Premium Member
- Posts: 3
- Joined: Fri Apr 22, 2011 3:12 pm
- Location: NW Arkansas
- Contact:
So now I am confused but let me attempt to restate what you said using a specific example and maybe that will clear things up.
Today I create a job that uses the merge stage to join 3 different sources of data (datasets) and then loads into a target table. Because of the requirements of the merge stage I explicitly hash partition my source data, pre-sort, and remove dups. I run this using a single-node config file and proceed to roll the job out to production
A month later I decide the job is running too slow and in order to speed it up I simply change the config file parameter to the 2 node config file rather than the 1 node.
If I did this without testing the results first, I think you are saying that it is possible that one or more of my data sources will not be partitioned correctly because DataStage will assume that same partitioning is appropriate (even though I explicitly selected hash partitioning). To elaborate, if the data in the source dataset happened to be partitioned using the round-robin method (created by a prior process) I could drop records unintentionally because the data is no longer being repartitioned prior to merge. Correct?
The same would be true if the original job had been set to auto partition...DataStage could incorrectly choose same partitioning. Correct?
The fact that DataStage makes optimization changes even when you explicitly select a partitioning method seems like a big problem...do you know if the later versions of DataStage (i.e. 8.7 or 9.1) still have this bug/feature?
As you stated this is one example of several, so in summary it sounds like it's not a good idea to ever mass change the parallelism of jobs from single node to multi-node...
Today I create a job that uses the merge stage to join 3 different sources of data (datasets) and then loads into a target table. Because of the requirements of the merge stage I explicitly hash partition my source data, pre-sort, and remove dups. I run this using a single-node config file and proceed to roll the job out to production
A month later I decide the job is running too slow and in order to speed it up I simply change the config file parameter to the 2 node config file rather than the 1 node.
If I did this without testing the results first, I think you are saying that it is possible that one or more of my data sources will not be partitioned correctly because DataStage will assume that same partitioning is appropriate (even though I explicitly selected hash partitioning). To elaborate, if the data in the source dataset happened to be partitioned using the round-robin method (created by a prior process) I could drop records unintentionally because the data is no longer being repartitioned prior to merge. Correct?
The same would be true if the original job had been set to auto partition...DataStage could incorrectly choose same partitioning. Correct?
The fact that DataStage makes optimization changes even when you explicitly select a partitioning method seems like a big problem...do you know if the later versions of DataStage (i.e. 8.7 or 9.1) still have this bug/feature?
As you stated this is one example of several, so in summary it sounds like it's not a good idea to ever mass change the parallelism of jobs from single node to multi-node...
Jonathan Beckford
Teradata DI Center Of Excellence
Teradata DI Center Of Excellence
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Yes, this is what I - and Craig before me - stated: it is not a good idea to mass-change from single-node to multi-node processing.
Problems that arise from the change are rare, but they are very difficult to identify and debug.
Problems that arise from the change are rare, but they are very difficult to identify and debug.
"It is not the lucky ones are grateful.
There are the grateful those are happy." Francis Bacon
There are the grateful those are happy." Francis Bacon