Setting Reader per node on sequential file

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
seanc217
Premium Member
Premium Member
Posts: 188
Joined: Thu Sep 15, 2005 9:22 am

Setting Reader per node on sequential file

Post by seanc217 »

Hi there,

What is an optimal setting for this?
If the answer is, it depends then what is the process for finding the optimal setting?

Do I continually increment it until I see no performance gain?

Thanks for the help!!
Kirtikumar
Participant
Posts: 437
Joined: Fri Oct 15, 2004 6:13 am
Location: Pune, India

Post by Kirtikumar »

Have you checked why this is used and when it can be used?

This options should be used only if your sequential file is fixed width.
Regards,
S. Kirtikumar.
Kirtikumar
Participant
Posts: 437
Joined: Fri Oct 15, 2004 6:13 am
Location: Pune, India

Post by Kirtikumar »

Again this setting is like no. of nodes in config file.

You have to find out the optimum value for this by repited execution with diff values.
Regards,
S. Kirtikumar.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

If the source is fixed widht, ideally it should me at the max number of readers that you can include. Which is limited by your config file. Practically there are several other factors that might come into picture, like Netwrok traffic for each node to the Disk.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
Kirtikumar
Participant
Posts: 437
Joined: Fri Oct 15, 2004 6:13 am
Location: Pune, India

Post by Kirtikumar »

Kumar, does the number of readers per node are limited by no. of nodes in config file?

My observation is - it does not depend on the number of nodes. Meaning I can define 2 nodes in config file and can have 3 readers per node in seq file stage. Now what PX does, if there is only one file to be read, it will create 3 readers only on node1 (no reader for node 2) and each reader will then read some part of the file.
Regards,
S. Kirtikumar.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

It is number of readers per node. Hence if you specify 3 readers, it introduces 3 Sequential file read operator for each node. And hence you will have 6 in case of 2 nodes.
But effectively all the 3 readers will be reading a single file on a single node, which is mostly by a single CPU.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
Kirtikumar
Participant
Posts: 437
Joined: Fri Oct 15, 2004 6:13 am
Location: Pune, India

Post by Kirtikumar »

So means only 3 nodes will be generated and if there is one more file provied in the property then it will generate 6 nodes.
I just tested it on comma separated file.

Observations - even though the file was CSV, still there a was performance improvement and all 3 readers were reading some of the record in it.
Now dont understand why these IBM ppl are saying that should be used only in case os fixed width file. Anyone has any idea on this?
Regards,
S. Kirtikumar.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Nodes will not be generated, but the readers (operators) are.
Reason for the fixed width file would be, it can use Memory mapped IO. In case of delimited file, the reader has to read sequentially to find the delimiter first.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

If you specify N readers per node, then each process will read 1/Nth of the rows from the file. A sequential file is, by default, read only on one node (that is, the stage executes in sequential mode) and the data are partitioned subsequently.

If you have multiple File properties, or you are reading a file pattern, then more nodes may become involved.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
tagnihotri
Participant
Posts: 83
Joined: Sat Oct 28, 2006 6:25 am

Post by tagnihotri »

Guys, so is it possible to find an aproximation in terms of relationhsip between number of nodes and readers. I was surpised though by the fact that using multiple readers does initiate multiple operators and it actually reads the file in parallel as in different set of data at the same time.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Yes, N times on the same node. On each node, if more than one. Each reader process reads 1/Nth of the rows from the file.

Multiple readers per node is a technique for applying more than one process to reading a sequential file (which has to be read sequentially). At the beginning the size of the file is determined and the location within the file of the 1/N offsets. Each process begins reading at one of these and stops at the next.

For fixed-width format files it's easy, as the offsets are a whole number of lines.

For variable (delimited) format files, it's difficult - having located the theoretical offset each process has to scan ahead to the next record delimiter and begin from there.

If follows that variable format files with no record delimiters is not a supported format for multiple readers per node.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply