Number of readers per node --

Satwika · Post by **Satwika** » Fri Sep 21, 2012 1:41 am

Hi Every one

I have one parallel job designed as below :

File1 ---> Removeduplicate --> JOIN1 --->tnf- -> JOIN2------>Ouptput

File 2-->RmvDuplicate--------> JOIN1
File3 --> RmvDuplicate-------------------------------->JOIN2

Case 1 :

File1, file2, file3 are reading normally (not enabled the property 'number of records per node' )

I was getting no.of output records are 1000.

-------------------------------------------------------------
Case 2:
The property 'number of records per node' is enabled for all 3 files (File1, file2, File3).
I was getting no.of output records are 5000.

The jobs in the above two cases are replicas but only change in file property.
Can anyone please let me know why the difference in receiving number of output records.

Thansks & Regards
Satwika

ArndW · Post by **ArndW** » Fri Sep 21, 2012 2:53 am

How many parallel nodes are running when you execute this job and what partitioning algorithm have you chosen between your files and the remove duplicates stage. This might be your root problem.

Satwika · Post by **Satwika** » Fri Sep 21, 2012 4:03 am

We are running on two node configuration and using Hash partition with internal sorting in remove duplicate stage and 'auto' is using in both joins. (JOIN1, JOIN2)

The above configurations/logic is same in two jobs (cases) as mentioned in question.

ArndW · Post by **ArndW** » Fri Sep 21, 2012 5:43 am

What column are you hashing on and what column(s) are you using for your join?

Satwika · Post by **Satwika** » Fri Sep 21, 2012 6:04 am

Hi ArndW

Remove Duplicate stage - hash partition with internal sort option - on the below columns :

Col1
Col2
Col3
Col4

Performing Inner Join on the below columns

Col1
Col2
Col3
Col5
Col6

i.e. hash performing on 4 key columns, and join performs on 5 columns in which 3 columns are common as shown above.

ArndW · Post by **ArndW** » Sun Sep 23, 2012 10:26 am

Try the hash on the first 1, 2 or 3 columns (must be identical for Left and Right links).

Satwika · Post by **Satwika** » Tue Sep 25, 2012 1:13 am

I have tried with the common columns ( Col1, Col2, Col3) by giving hash partition in join stage. ...

which is not helpful.

ArndW · Post by **ArndW** » Tue Sep 25, 2012 3:39 am

Both input links must have identical hash columns and ordering as well as being sorted on the join key.

To keep things simple, hash both input links to the join on "Col1" and then sort both on Col1, Col2, Col3, Col5 and Col6.

Alternatively run the job with a 1-node configuration and see if the problems persist. If they do, then you have a problem with your sorting, if they don't then your issue is with partitioning.

Satwika · Post by **Satwika** » Wed Sep 26, 2012 1:58 am

Thanks AndrW

The size of File3 is 6GB. is it cause problem ? Why i'm asking is- with small file size (Prepared testdata) , without doing any modifications, the data coming out correctly from join.

( File1 :- 2GB , File2 :- 2GB and File3 :- 6GB)

Satwika · Post by **Satwika** » Wed Sep 26, 2012 5:09 am

how many no. of readers per node can be declared at max. in a job ? Is there any limit ?

Satwika · Post by **Satwika** » Thu Sep 27, 2012 3:52 am

Can anyone know this issue....

Satwika · Post by **Satwika** » Thu Sep 27, 2012 3:53 am

Can anyone know this issue....

ray.wurlod · Post by **ray.wurlod** » Thu Sep 27, 2012 4:05 am

There is no limit. The GUI may limit the number you can enter.

However there are stupid values (for example 4000000 readers for 2000000 rows).

Think about the architecture. Sequential File stage uses the STREAMS I/O module under the cover so, even with only one reader per node, you are going to be reading data at a pretty fast rate.

More than one reader per node will not generally help, except for very, very large numbers of rows. Typically the consumer stages in the job will limit the speed at which rows can be processed, not the Sequential File stage.

Satwika · Post by **Satwika** » Fri Sep 28, 2012 12:34 am

Thanks Ray..I Understood .. but my basic problem has not solved.
Can anyone face this type issue. ?

Please Refer my post.

ArndW · Post by **ArndW** » Fri Sep 28, 2012 2:03 am

Run your job in a 1-node configuration and see if the error remains. If it is still there with 1-node then your partitioning is not at the root of the problem.