Setting Reader per node on sequential file

seanc217 · Post by **seanc217** » Tue Oct 24, 2006 10:31 am

Hi there,

What is an optimal setting for this?
If the answer is, it depends then what is the process for finding the optimal setting?

Do I continually increment it until I see no performance gain?

Thanks for the help!!

Kirtikumar · Post by **Kirtikumar** » Wed Oct 25, 2006 11:46 pm

Have you checked why this is used and when it can be used?

This options should be used only if your sequential file is fixed width.

Kirtikumar · Post by **Kirtikumar** » Thu Oct 26, 2006 12:25 am

Again this setting is like no. of nodes in config file.

You have to find out the optimum value for this by repited execution with diff values.

kumar_s · Post by **kumar_s** » Thu Oct 26, 2006 12:35 am

If the source is fixed widht, ideally it should me at the max number of readers that you can include. Which is limited by your config file. Practically there are several other factors that might come into picture, like Netwrok traffic for each node to the Disk.

Kirtikumar · Post by **Kirtikumar** » Thu Oct 26, 2006 2:10 am

Kumar, does the number of readers per node are limited by no. of nodes in config file?

My observation is - it does not depend on the number of nodes. Meaning I can define 2 nodes in config file and can have 3 readers per node in seq file stage. Now what PX does, if there is only one file to be read, it will create 3 readers only on node1 (no reader for node 2) and each reader will then read some part of the file.

kumar_s · Post by **kumar_s** » Thu Oct 26, 2006 2:19 am

It is number of readers per node. Hence if you specify 3 readers, it introduces 3 Sequential file read operator for each node. And hence you will have 6 in case of 2 nodes.
But effectively all the 3 readers will be reading a single file on a single node, which is mostly by a single CPU.

Kirtikumar · Post by **Kirtikumar** » Thu Oct 26, 2006 2:24 am

So means only 3 nodes will be generated and if there is one more file provied in the property then it will generate 6 nodes.
I just tested it on comma separated file.

Observations - even though the file was CSV, still there a was performance improvement and all 3 readers were reading some of the record in it.
Now dont understand why these IBM ppl are saying that should be used only in case os fixed width file. Anyone has any idea on this?

kumar_s · Post by **kumar_s** » Thu Oct 26, 2006 3:09 am

Nodes will not be generated, but the readers (operators) are.
Reason for the fixed width file would be, it can use Memory mapped IO. In case of delimited file, the reader has to read sequentially to find the delimiter first.

ray.wurlod · Post by **ray.wurlod** » Thu Oct 26, 2006 9:18 am

If you specify N readers per node, then each process will read 1/Nth of the rows from the file. A sequential file is, by default, read only on one node (that is, the stage executes in sequential mode) and the data are partitioned subsequently.

If you have multiple File properties, or you are reading a file pattern, then more nodes may become involved.

tagnihotri · Post by **tagnihotri** » Sat Oct 28, 2006 7:01 am

Guys, so is it possible to find an aproximation in terms of relationhsip between number of nodes and readers. I was surpised though by the fact that using multiple readers does initiate multiple operators and it actually reads the file in parallel as in different set of data at the same time.

ray.wurlod · Post by **ray.wurlod** » Sat Oct 28, 2006 2:03 pm

Yes, N times on the same node. On each node, if more than one. Each reader process reads 1/Nth of the rows from the file.

Multiple readers per node is a technique for applying more than one process to reading a sequential file (which has to be read sequentially). At the beginning the size of the file is determined and the location within the file of the 1/N offsets. Each process begins reading at one of these and stops at the next.

For fixed-width format files it's easy, as the offsets are a whole number of lines.

For variable (delimited) format files, it's difficult - having located the theoretical offset each process has to scan ahead to the next record delimiter and begin from there.

If follows that variable format files with no record delimiters is not a supported format for multiple readers per node.