Setting Reader per node on sequential file
Moderators: chulett, rschirm, roy
Setting Reader per node on sequential file
Hi there,
What is an optimal setting for this?
If the answer is, it depends then what is the process for finding the optimal setting?
Do I continually increment it until I see no performance gain?
Thanks for the help!!
What is an optimal setting for this?
If the answer is, it depends then what is the process for finding the optimal setting?
Do I continually increment it until I see no performance gain?
Thanks for the help!!
-
- Participant
- Posts: 437
- Joined: Fri Oct 15, 2004 6:13 am
- Location: Pune, India
-
- Participant
- Posts: 437
- Joined: Fri Oct 15, 2004 6:13 am
- Location: Pune, India
If the source is fixed widht, ideally it should me at the max number of readers that you can include. Which is limited by your config file. Practically there are several other factors that might come into picture, like Netwrok traffic for each node to the Disk.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
-
- Participant
- Posts: 437
- Joined: Fri Oct 15, 2004 6:13 am
- Location: Pune, India
Kumar, does the number of readers per node are limited by no. of nodes in config file?
My observation is - it does not depend on the number of nodes. Meaning I can define 2 nodes in config file and can have 3 readers per node in seq file stage. Now what PX does, if there is only one file to be read, it will create 3 readers only on node1 (no reader for node 2) and each reader will then read some part of the file.
My observation is - it does not depend on the number of nodes. Meaning I can define 2 nodes in config file and can have 3 readers per node in seq file stage. Now what PX does, if there is only one file to be read, it will create 3 readers only on node1 (no reader for node 2) and each reader will then read some part of the file.
Regards,
S. Kirtikumar.
S. Kirtikumar.
It is number of readers per node. Hence if you specify 3 readers, it introduces 3 Sequential file read operator for each node. And hence you will have 6 in case of 2 nodes.
But effectively all the 3 readers will be reading a single file on a single node, which is mostly by a single CPU.
But effectively all the 3 readers will be reading a single file on a single node, which is mostly by a single CPU.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
-
- Participant
- Posts: 437
- Joined: Fri Oct 15, 2004 6:13 am
- Location: Pune, India
So means only 3 nodes will be generated and if there is one more file provied in the property then it will generate 6 nodes.
I just tested it on comma separated file.
Observations - even though the file was CSV, still there a was performance improvement and all 3 readers were reading some of the record in it.
Now dont understand why these IBM ppl are saying that should be used only in case os fixed width file. Anyone has any idea on this?
I just tested it on comma separated file.
Observations - even though the file was CSV, still there a was performance improvement and all 3 readers were reading some of the record in it.
Now dont understand why these IBM ppl are saying that should be used only in case os fixed width file. Anyone has any idea on this?
Regards,
S. Kirtikumar.
S. Kirtikumar.
Nodes will not be generated, but the readers (operators) are.
Reason for the fixed width file would be, it can use Memory mapped IO. In case of delimited file, the reader has to read sequentially to find the delimiter first.
Reason for the fixed width file would be, it can use Memory mapped IO. In case of delimited file, the reader has to read sequentially to find the delimiter first.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
If you specify N readers per node, then each process will read 1/Nth of the rows from the file. A sequential file is, by default, read only on one node (that is, the stage executes in sequential mode) and the data are partitioned subsequently.
If you have multiple File properties, or you are reading a file pattern, then more nodes may become involved.
If you have multiple File properties, or you are reading a file pattern, then more nodes may become involved.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Participant
- Posts: 83
- Joined: Sat Oct 28, 2006 6:25 am
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Yes, N times on the same node. On each node, if more than one. Each reader process reads 1/Nth of the rows from the file.
Multiple readers per node is a technique for applying more than one process to reading a sequential file (which has to be read sequentially). At the beginning the size of the file is determined and the location within the file of the 1/N offsets. Each process begins reading at one of these and stops at the next.
For fixed-width format files it's easy, as the offsets are a whole number of lines.
For variable (delimited) format files, it's difficult - having located the theoretical offset each process has to scan ahead to the next record delimiter and begin from there.
If follows that variable format files with no record delimiters is not a supported format for multiple readers per node.
Multiple readers per node is a technique for applying more than one process to reading a sequential file (which has to be read sequentially). At the beginning the size of the file is determined and the location within the file of the 1/N offsets. Each process begins reading at one of these and stops at the next.
For fixed-width format files it's easy, as the offsets are a whole number of lines.
For variable (delimited) format files, it's difficult - having located the theoretical offset each process has to scan ahead to the next record delimiter and begin from there.
If follows that variable format files with no record delimiters is not a supported format for multiple readers per node.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.