Repartitioning

vamsi.4a6 · Post by **vamsi.4a6** » Mon Jun 23, 2014 6:38 am

i checked in google and red books to get information about repartiting in datastage but no information.I heard about partition and collecting algorithim.what is repartiting in datastage
and when it is required.I came to know this term when i had discussion with my team mate?

ssnegi · Post by **ssnegi** » Mon Jun 23, 2014 7:09 am

Repartitioning data.

Repartitioning refers to changing the partition from stage to stage. This adversely affects performance due to the time taken to change the partition and should be generally avoided.

ArndW · Post by **ArndW** » Mon Jun 23, 2014 4:16 pm

Woah - hold your horses and slow down... Repartitioning is not necessarily a bad thing to be avoided, while it does cause overhead it is sometimes necessary in order to correctly process data.

When you are parallel processing, for example with 3 parallel streams and each stream is processing 1/3 of the data (let's say using a round-robin partitioning algorithm) you normally wouldn't need to re-partition the data. But if you need to include the minimum value of column "B" in your output, then those 3 streams of data would need to be repartitioned on column "B" in order to determine the minimum, otherwise each stream would get a minimum of only those records which it processes.

Often this repartitioning can be minimized in an optimized job design, but rarely can one avoid repartitioning in a complex parallel job.

ray.wurlod · Post by **ray.wurlod** » Mon Jun 23, 2014 4:43 pm

Repartitioning theoretically requires to move rows to other nodes (partitions). On MPP/cluster/grid environments this may mean moving rows over the network (TCP/IP), which degrades performance somewhat. However, on an SMP ("share everything") environment, one copy of the data are kept in shared memory, so there is no actual movement of rows over the network.

Repartitioning, like sorting, comes at a cost but is sometimes necessary.