about usage of Link partitioners

pavanns · Post by **pavanns** » Thu Sep 29, 2005 3:39 pm

Hi
can link partitoning be done in server or has it to be only on parallel jobs ..im trying just to practice abt link partitioners and link collectors..when i try this all the output frm the link partitioners is in to one of the three transformers that i have linked to teh link partitioner..the other two transformers have no input values..why is this happening am i correct in joining 3 transformers in parallel to a LP..pls throw some light on this

pavanns · Post by **pavanns** » Thu Sep 29, 2005 3:49 pm

let me add to the above query : wht kind of partitioning is best in case of performance analysis in this stage

pavanns · Post by **pavanns** » Thu Sep 29, 2005 4:33 pm

pls help me

trokosz · Post by **trokosz** » Fri Sep 30, 2005 10:55 am

Yes, LinkPartitioner and LinkCollector Stages......But check out the IPC Stage.

ray.wurlod · Post by **ray.wurlod** » Sat Oct 01, 2005 1:02 am

I do not like to use Link Partitioners to achieve partition parallelism (see Chapter 2 of Parallel Job Developer's Guide) in server jobs. A job that uses a link partitioner presumably splits one stream of data for multiple processing. And a link collector has to gather them all back together into a single stream for writing.

Instead I would prefer to use a multi-instance job. In this way I am not bottlenecked on readers or writers. If necessary (for summarising across the entire set, for example), I might direct the various jobs' outputs into another job - perhaps using named pipes or some other ipc mechanism - in which that could occur. But I'd probably use intermediate text files, cat them all together, and use that as the single input for the final job.

ArndW · Post by **ArndW** » Sun Oct 02, 2005 4:05 am

Ray,

Just last week I came across a scenario where using the link partitioner made more sense than doing a multi-instance job.

The data flow came from a database table and some complex transformations were done to the data before writing it to a staging hashed file. The throughput was (I'll use rows/second to give scaling) about 400 rows/s on a large SMP machine.

It turns out that the database access was a full table select and as the database was on a remote machine with a fixed bandwidth it made no sense to split that into separate queries, but after some tuning of the complex transformations and using a link partitioner to split the data into 4 parallel links the performance went up to 5000 rows/s.

In this case the process was 100% CPU bound, so splitting the computations across several processes balanced the load so that the job was using all of it's I/O potential without having to wait for a single process' CPU. I think this is one of the few types of cases where I see link partitioning to be advantageous.

ray.wurlod · Post by **ray.wurlod** » Sun Oct 02, 2005 4:49 pm

I'll always keep an open mind (which is different from a hole in the head). Sure there will be exceptions to every "rule", and Arnd has highlighted a good one. I probably would have gone the same way there, or I may have split the task to load into a single local source (text file would do, and fixed-width is best) and run multi-instance jobs using that as source, each instance processing a different subset of rows.