Improve performance of Join of Data Sets

akarsh · Post by **akarsh** » Fri Mar 14, 2014 1:44 am

Hi All,

Even i am also facing the same issue as seen in this post.

I have two data set as input to join stage and its extracting only 276 row/sec.

below are the details

Data Set 1
-----------

Total Records: 19199366
Total 32k Blocks: 19141
Total Bytes: 2430032178

Node Records blocks Bytes
Node1 9593377 9566 1214500680
Node2 9605989 9575 1215531492

Data Set 2
-----------

Total Records: 19199355
Total 32k Blocks: 23812
Total Bytes: 3041367308

Node Records blocks Bytes
Node1 9597492 11903 1520355820
Node2 9601863 11909 1521011488

Please suggest what can be done to improve performance .

I also added the two env variable as suggested by Ravi keeping default value but didnt get any help.

thompsonp · Post by **thompsonp** » Fri Mar 14, 2014 2:27 am

Akarsh

Perhaps you could follow the advice already given and post your results.
What does the rest of the job look like and are the datasets already partitioned and sorted for the join? Are they being repartitioned / sorted?

Ravi has not responded to the advice given and you have just replicated a change he made but kept default values (which is presumably the same as not adding them). There's plenty of help in that thread if you choose to follow it.

chulett · Post by **chulett** » Fri Mar 14, 2014 7:15 am

Similar != same, so now you have your own post to track your issue... which I don't believe is related to 'slow reading'. Thompsonp's questions need to be answered.

akarsh · Post by **akarsh** » Fri Mar 14, 2014 9:41 am

Hi thompsonp,

I am just having join the the job. and its same partition in join.
speed is 1400-1600 row/ sec.

Also have changed the buffer at join and kept it 6 MB.

i/p Meta Data is around 500 bytes and out around 1000 bytes as am having full outer join in the job.

earlier it was delete then insert job. Delete was taking long time app 19 hr, so changing it to truncate and load by saving data not delete using full outer join.