Performance of repartitioning
Posted: Thu Feb 09, 2012 7:46 pm
Problem:
If we run the batchflow with a different number of nodes, some jobs fail with the message:
There are irreconcilable constraints on the number of
partitions of an operator: parallel copyPlaceholder05.
The number of partitions is already constrained to 2,
but an eSame partitioned input virtual dataset produced by
parallel filterObsoleteMetrics has 8.
Resolution:
We fixed the problem by setting the "Preserve partitiong" to Clear, on stages that did not use the Auto "Partition type".
e.g. The Join stages use Hash partitioning method, and were using the default Propagate setting to pass the partitions to the next stage. It was the subsequent stages (filterObsoleteMetrics, then copyPlaceholder05) that reported the error message. By changing the "Preserve partitioning" setting on the Join stage from "Propagate" to "Clear", the error message no longer appeared.
Question:
Is there a significant overhead in using "Clear" instead of "Propagate"?
My initial tests show no difference in elapsed time (joining 20K rows to 25K rows on 2 nodes), but I find that difficult to believe.
Do people have a better solution that enable us to change the number of nodes?
I notice that in another thread, Ray suggests "better management", but I'm not sure what that means in practice.
If we run the batchflow with a different number of nodes, some jobs fail with the message:
There are irreconcilable constraints on the number of
partitions of an operator: parallel copyPlaceholder05.
The number of partitions is already constrained to 2,
but an eSame partitioned input virtual dataset produced by
parallel filterObsoleteMetrics has 8.
Resolution:
We fixed the problem by setting the "Preserve partitiong" to Clear, on stages that did not use the Auto "Partition type".
e.g. The Join stages use Hash partitioning method, and were using the default Propagate setting to pass the partitions to the next stage. It was the subsequent stages (filterObsoleteMetrics, then copyPlaceholder05) that reported the error message. By changing the "Preserve partitioning" setting on the Join stage from "Propagate" to "Clear", the error message no longer appeared.
Question:
Is there a significant overhead in using "Clear" instead of "Propagate"?
My initial tests show no difference in elapsed time (joining 20K rows to 25K rows on 2 nodes), but I find that difficult to believe.
Do people have a better solution that enable us to change the number of nodes?
I notice that in another thread, Ray suggests "better management", but I'm not sure what that means in practice.