Improving a large Join

jusami25 · Post by **jusami25** » Thu Oct 02, 2014 1:47 am

Hi guys!

I have 2 Netezza stages with 500 Millions (left flow, with 200 columns) and 50 Millions of records (right flow, with 10 columns) each one. Then, I have a Left Join between that two flows, ending in a Netezza writing stage.

I use the Balanced Optimization Tool to improve the performance in this job, but even with this, the process is taking so much time to do the join (~40 minutes).

I dont have any partition on the input flows of the join, and i am using ORDER BY clause in the Netezza Stages to having the records sorted by the 3 join key columns: COUNTRY (INT), MONTH (INT), ID (BIGINT).

Any ideas? What can i try now?

jusami25 · Post by **jusami25** » Mon Oct 06, 2014 3:00 am

Nothing?

priyadarshikunal · Post by **priyadarshikunal** » Mon Oct 06, 2014 4:04 am

Did you check the score to find out if any sort is inserted before join by datastage which is sorting your data again?

jusami25 · Post by **jusami25** » Mon Oct 06, 2014 4:33 am

Yes, and i removed it to improve almost 3 minutes the join process.

Is it possible that executing a simple "SELECT * FROM TABLE1 left join SELECT * FROM TABLE2" between this two large tables is faster than the same operation using Balanced Optimization?

jerome_rajan · Post by **jerome_rajan** » Mon Oct 06, 2014 6:33 am

I dont have any partition on the input flows of the join

Why?

I use the Balanced Optimization Tool to improve the performance in this job

Is it possible that executing a simple "SELECT * FROM TABLE1 left join SELECT * FROM TABLE2" between this two large tables is faster than the same operation using Balanced Optimization?

I am keen to see what the output of your Balanced Optimization was. I thought that BO typically pushed the DataStage join to a Database join akin to the left outer query you posted.

jusami25 · Post by **jusami25** » Mon Oct 06, 2014 9:40 am

jerome_rajan wrote:Why?

Cause i think the BO don't consider the partition options, only if we sort the input data in the stages, BO perform an ORDER BY clause in the optimized query.

Analizing the optimized query, and the whole netezza stage itself, i cant see any reference to the partition options. I think that the BO doesn't consider the partition options, only if we sort the input data in the stages too we can see the ORDER BY clauses that i've been talking about.

jerome_rajan wrote:I am keen to see what the output of your Balanced Optimization was. I thought that BO typically pushed the DataStage join to a Database join akin to the left outer query you posted.

The esence of the optimized query is the same than the original one, but BO adds some SELECT, CAST and ORDER BY clauses. Even if i delete all of the clauses leaving only the SELECT - LEFT JOIN - SELECT clauses inside the Netezza stage, and i launch that job, Datastage is slower than the SELECT - LEFT JOIN - SELECT original SQL.

ray.wurlod · Post by **ray.wurlod** » Mon Oct 06, 2014 3:34 pm

Perhaps, then, the original SQL's join is supported by indexes, a luxury that DataStage does not have (and the reason that DataStage requires inputs to the Join stage to be sorted on the join key).

jusami25 · Post by **jusami25** » Tue Oct 07, 2014 2:22 am

Hi Ray, thanks for your reply.

The Netezza database doesn't have indexes, only distribution keys. BO is generating a job that contains a RowGenerator that executes a Netezza Stage with all the process included into the AFTER-SQL field.

priyadarshikunal · Post by **priyadarshikunal** » Tue Oct 07, 2014 3:39 am

Did you try using join in netezza itself and taking only those fields you need. It depends which server (DataStage or Netezza) has more processing capacity and if the distribution keys are defined correctly to support the joins. It will be more efficient and will take less time if Netezza server has good capacity and correct distribution keys are defined.

Distribution keys help the joins in Netezza in the same way indexes does in other RDBMS. As Netezza has AMPP architecture distribution keys is supposed to do a better job in its case and hence the index is not used in it.

priyadarshikunal · Post by **priyadarshikunal** » Tue Oct 07, 2014 3:43 am

This is a write-up on developer works about indexes and distribution keys

https://www.ibm.com/developerworks/comm ... 12?lang=en