Performance of repartitioning

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
Gazelle
Premium Member
Premium Member
Posts: 108
Joined: Mon Nov 24, 2003 11:36 pm
Location: Australia (Melbourne)

Performance of repartitioning

Post by Gazelle »

Problem:
If we run the batchflow with a different number of nodes, some jobs fail with the message:
There are irreconcilable constraints on the number of
partitions of an operator: parallel copyPlaceholder05.
The number of partitions is already constrained to 2,
but an eSame partitioned input virtual dataset produced by
parallel filterObsoleteMetrics has 8.


Resolution:
We fixed the problem by setting the "Preserve partitiong" to Clear, on stages that did not use the Auto "Partition type".

e.g. The Join stages use Hash partitioning method, and were using the default Propagate setting to pass the partitions to the next stage. It was the subsequent stages (filterObsoleteMetrics, then copyPlaceholder05) that reported the error message. By changing the "Preserve partitioning" setting on the Join stage from "Propagate" to "Clear", the error message no longer appeared.

Question:
Is there a significant overhead in using "Clear" instead of "Propagate"?
My initial tests show no difference in elapsed time (joining 20K rows to 25K rows on 2 nodes), but I find that difficult to believe.

Do people have a better solution that enable us to change the number of nodes?
I notice that in another thread, Ray suggests "better management", but I'm not sure what that means in practice.
SURA
Premium Member
Premium Member
Posts: 1229
Joined: Sat Jul 14, 2007 5:16 am
Location: Sydney

Re: Performance of repartitioning

Post by SURA »

Yes it may be when the record volume is millions / billions.

For some reason i did it in a job and found good difference, with the same job was running with partition / auto.

DS User
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Repartitioning in an SMP environment is effectively cost-free, since it's all done through shared memory. It's where there's more than one fastname in the configuration file that the costs kick in - TCP connections have to be established between producer and consumer player processes and data transferred at network speeds rather than at memory speeds.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Gazelle
Premium Member
Premium Member
Posts: 108
Joined: Mon Nov 24, 2003 11:36 pm
Location: Australia (Melbourne)

Post by Gazelle »

Thanks Ray.
Yes, we have a SMP environment.

Further investigation showed the the job that failed was appending to a dataset that had "Preserve Partition" set to true.
i.e. orchadmin lp MetricsObsolete.ds 2>/dev/null | grep "Preserve"
returns
Preserve Partitioning: true

So a better solution is to change the "Preserve Partitioning" from Propagate to Clear on the output link that writes the dataset.
qt_ky
Premium Member
Premium Member
Posts: 2895
Joined: Wed Aug 03, 2011 6:16 am
Location: USA

Post by qt_ky »

25K records is not large enough to notice any difference if there were going to be one.
Choose a job you love, and you will never have to work a day in your life. - Confucius
Gazelle
Premium Member
Premium Member
Posts: 108
Joined: Mon Nov 24, 2003 11:36 pm
Location: Australia (Melbourne)

Post by Gazelle »

If it were a cartesian join then we'd have 500,000,000 rows. :)

Yes, thanks qt_ky. SURA made that point too.
The 20K was just some test data I had lying around. It is also indicative of the numbers we can expect in production, so you're right that in this case, Clearing the partitioning will have negligible effect.

However, I really wanted a general answer on the performance hit of repartitioning (so we can decide what to do with other jobs that have higher volumes).
Ray's answered this.

For now, since repartitioning will have negligible effect in environments that share memory, we'll just clear partitioning before writing datasets. If we start hitting performance problems we can tweak things then.

I'll mark this one as "resolved".
Post Reply