Repartitioning

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
vamsi.4a6
Participant
Posts: 334
Joined: Sun Jan 22, 2012 7:06 am
Contact:

Repartitioning

Post by vamsi.4a6 »

i checked in google and red books to get information about repartiting in datastage but no information.I heard about partition and collecting algorithim.what is repartiting in datastage
and when it is required.I came to know this term when i had discussion with my team mate?
Thanks and Regards
Vamsi krishna.v
http://datastage-vamsi.blogspot.in/
ssnegi
Participant
Posts: 138
Joined: Thu Nov 15, 2007 4:17 am
Location: Sydney, Australia

Post by ssnegi »

Repartitioning data.

Repartitioning refers to changing the partition from stage to stage. This adversely affects performance due to the time taken to change the partition and should be generally avoided.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Woah - hold your horses and slow down... Repartitioning is not necessarily a bad thing to be avoided, while it does cause overhead it is sometimes necessary in order to correctly process data.

When you are parallel processing, for example with 3 parallel streams and each stream is processing 1/3 of the data (let's say using a round-robin partitioning algorithm) you normally wouldn't need to re-partition the data. But if you need to include the minimum value of column "B" in your output, then those 3 streams of data would need to be repartitioned on column "B" in order to determine the minimum, otherwise each stream would get a minimum of only those records which it processes.

Often this repartitioning can be minimized in an optimized job design, but rarely can one avoid repartitioning in a complex parallel job.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Repartitioning theoretically requires to move rows to other nodes (partitions). On MPP/cluster/grid environments this may mean moving rows over the network (TCP/IP), which degrades performance somewhat. However, on an SMP ("share everything") environment, one copy of the data are kept in shared memory, so there is no actual movement of rows over the network.

Repartitioning, like sorting, comes at a cost but is sometimes necessary.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply