Improving a large Join

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
jusami25
Premium Member
Premium Member
Posts: 84
Joined: Tue Oct 26, 2004 12:49 am

Improving a large Join

Post by jusami25 »

Hi guys!

I have 2 Netezza stages with 500 Millions (left flow, with 200 columns) and 50 Millions of records (right flow, with 10 columns) each one. Then, I have a Left Join between that two flows, ending in a Netezza writing stage.

I use the Balanced Optimization Tool to improve the performance in this job, but even with this, the process is taking so much time to do the join (~40 minutes).

I dont have any partition on the input flows of the join, and i am using ORDER BY clause in the Netezza Stages to having the records sorted by the 3 join key columns: COUNTRY (INT), MONTH (INT), ID (BIGINT).

Any ideas? What can i try now?
jusami25
Premium Member
Premium Member
Posts: 84
Joined: Tue Oct 26, 2004 12:49 am

Post by jusami25 »

Nothing?
priyadarshikunal
Premium Member
Premium Member
Posts: 1735
Joined: Thu Mar 01, 2007 5:44 am
Location: Troy, MI

Post by priyadarshikunal »

Did you check the score to find out if any sort is inserted before join by datastage which is sorting your data again?
Priyadarshi Kunal

Genius may have its limitations, but stupidity is not thus handicapped. :wink:
jusami25
Premium Member
Premium Member
Posts: 84
Joined: Tue Oct 26, 2004 12:49 am

Post by jusami25 »

Yes, and i removed it to improve almost 3 minutes the join process.

Is it possible that executing a simple "SELECT * FROM TABLE1 left join SELECT * FROM TABLE2" between this two large tables is faster than the same operation using Balanced Optimization?
jerome_rajan
Premium Member
Premium Member
Posts: 376
Joined: Sat Jan 07, 2012 12:25 pm
Location: Piscataway

Post by jerome_rajan »

I dont have any partition on the input flows of the join
Why?
I use the Balanced Optimization Tool to improve the performance in this job
Is it possible that executing a simple "SELECT * FROM TABLE1 left join SELECT * FROM TABLE2" between this two large tables is faster than the same operation using Balanced Optimization?
I am keen to see what the output of your Balanced Optimization was. I thought that BO typically pushed the DataStage join to a Database join akin to the left outer query you posted.
Jerome
Data Integration Consultant at AWS
Connect With Me On LinkedIn

Life is really simple, but we insist on making it complicated.
jusami25
Premium Member
Premium Member
Posts: 84
Joined: Tue Oct 26, 2004 12:49 am

Post by jusami25 »

jerome_rajan wrote:Why?
Cause i think the BO don't consider the partition options, only if we sort the input data in the stages, BO perform an ORDER BY clause in the optimized query.

Analizing the optimized query, and the whole netezza stage itself, i cant see any reference to the partition options. I think that the BO doesn't consider the partition options, only if we sort the input data in the stages too we can see the ORDER BY clauses that i've been talking about.
jerome_rajan wrote:I am keen to see what the output of your Balanced Optimization was. I thought that BO typically pushed the DataStage join to a Database join akin to the left outer query you posted.
The esence of the optimized query is the same than the original one, but BO adds some SELECT, CAST and ORDER BY clauses. Even if i delete all of the clauses leaving only the SELECT - LEFT JOIN - SELECT clauses inside the Netezza stage, and i launch that job, Datastage is slower than the SELECT - LEFT JOIN - SELECT original SQL.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Perhaps, then, the original SQL's join is supported by indexes, a luxury that DataStage does not have (and the reason that DataStage requires inputs to the Join stage to be sorted on the join key).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
jusami25
Premium Member
Premium Member
Posts: 84
Joined: Tue Oct 26, 2004 12:49 am

Post by jusami25 »

Hi Ray, thanks for your reply.

The Netezza database doesn't have indexes, only distribution keys. BO is generating a job that contains a RowGenerator that executes a Netezza Stage with all the process included into the AFTER-SQL field.
priyadarshikunal
Premium Member
Premium Member
Posts: 1735
Joined: Thu Mar 01, 2007 5:44 am
Location: Troy, MI

Post by priyadarshikunal »

Did you try using join in netezza itself and taking only those fields you need. It depends which server (DataStage or Netezza) has more processing capacity and if the distribution keys are defined correctly to support the joins. It will be more efficient and will take less time if Netezza server has good capacity and correct distribution keys are defined.

Distribution keys help the joins in Netezza in the same way indexes does in other RDBMS. As Netezza has AMPP architecture distribution keys is supposed to do a better job in its case and hence the index is not used in it.
Priyadarshi Kunal

Genius may have its limitations, but stupidity is not thus handicapped. :wink:
priyadarshikunal
Premium Member
Premium Member
Posts: 1735
Joined: Thu Mar 01, 2007 5:44 am
Location: Troy, MI

Post by priyadarshikunal »

This is a write-up on developer works about indexes and distribution keys

https://www.ibm.com/developerworks/comm ... 12?lang=en
Priyadarshi Kunal

Genius may have its limitations, but stupidity is not thus handicapped. :wink:
Post Reply