Number of readers per node --
Moderators: chulett, rschirm, roy
Number of readers per node --
Hi Every one
I have one parallel job designed as below :
File1 ---> Removeduplicate --> JOIN1 --->tnf- -> JOIN2------>Ouptput
File 2-->RmvDuplicate--------> JOIN1
File3 --> RmvDuplicate-------------------------------->JOIN2
Case 1 :
File1, file2, file3 are reading normally (not enabled the property 'number of records per node' )
I was getting no.of output records are 1000.
-------------------------------------------------------------
Case 2:
The property 'number of records per node' is enabled for all 3 files (File1, file2, File3).
I was getting no.of output records are 5000.
The jobs in the above two cases are replicas but only change in file property.
Can anyone please let me know why the difference in receiving number of output records.
Thansks & Regards
Satwika
I have one parallel job designed as below :
File1 ---> Removeduplicate --> JOIN1 --->tnf- -> JOIN2------>Ouptput
File 2-->RmvDuplicate--------> JOIN1
File3 --> RmvDuplicate-------------------------------->JOIN2
Case 1 :
File1, file2, file3 are reading normally (not enabled the property 'number of records per node' )
I was getting no.of output records are 1000.
-------------------------------------------------------------
Case 2:
The property 'number of records per node' is enabled for all 3 files (File1, file2, File3).
I was getting no.of output records are 5000.
The jobs in the above two cases are replicas but only change in file property.
Can anyone please let me know why the difference in receiving number of output records.
Thansks & Regards
Satwika
How many parallel nodes are running when you execute this job and what partitioning algorithm have you chosen between your files and the remove duplicates stage. This might be your root problem.
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
What column are you hashing on and what column(s) are you using for your join?
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
Hi ArndW
Remove Duplicate stage - hash partition with internal sort option - on the below columns :
Col1
Col2
Col3
Col4
Performing Inner Join on the below columns
Col1
Col2
Col3
Col5
Col6
i.e. hash performing on 4 key columns, and join performs on 5 columns in which 3 columns are common as shown above.
Remove Duplicate stage - hash partition with internal sort option - on the below columns :
Col1
Col2
Col3
Col4
Performing Inner Join on the below columns
Col1
Col2
Col3
Col5
Col6
i.e. hash performing on 4 key columns, and join performs on 5 columns in which 3 columns are common as shown above.
Try the hash on the first 1, 2 or 3 columns (must be identical for Left and Right links).
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
Both input links must have identical hash columns and ordering as well as being sorted on the join key.
To keep things simple, hash both input links to the join on "Col1" and then sort both on Col1, Col2, Col3, Col5 and Col6.
Alternatively run the job with a 1-node configuration and see if the problems persist. If they do, then you have a problem with your sorting, if they don't then your issue is with partitioning.
To keep things simple, hash both input links to the join on "Col1" and then sort both on Col1, Col2, Col3, Col5 and Col6.
Alternatively run the job with a 1-node configuration and see if the problems persist. If they do, then you have a problem with your sorting, if they don't then your issue is with partitioning.
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
There is no limit. The GUI may limit the number you can enter.
However there are stupid values (for example 4000000 readers for 2000000 rows).
Think about the architecture. Sequential File stage uses the STREAMS I/O module under the cover so, even with only one reader per node, you are going to be reading data at a pretty fast rate.
More than one reader per node will not generally help, except for very, very large numbers of rows. Typically the consumer stages in the job will limit the speed at which rows can be processed, not the Sequential File stage.
However there are stupid values (for example 4000000 readers for 2000000 rows).
Think about the architecture. Sequential File stage uses the STREAMS I/O module under the cover so, even with only one reader per node, you are going to be reading data at a pretty fast rate.
More than one reader per node will not generally help, except for very, very large numbers of rows. Typically the consumer stages in the job will limit the speed at which rows can be processed, not the Sequential File stage.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Run your job in a 1-node configuration and see if the error remains. If it is still there with 1-node then your partitioning is not at the root of the problem.
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>