Performance Issue

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Just so we're all on the same page, if you do this:

Code: Select all

select count(*) from 
(
SELECT A.*, B.*, C.* FROM TABLE_A, TABLE_B, TABLE_C 
WHERE A.XYZ=B.XYZ AND B.MNO=C.MNO 
AND B.EFFDT BETWEEN C.EFFDT AND C.ENDDT
)
I just verified it doesn't change the original explain plan nor the cost or cardinality, it just adds the SORT AGGREGATE over the VIEW it creates to get you the count. So either method should be fine, I would think.
-craig

"You can never have too many knives" -- Logan Nine Fingers
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

Your query was probably not complicated enough. I've done this loads of time. You also get more efficient counts if you do SELECT COUNT(1) for the exact same reason.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
rleishman
Premium Member
Premium Member
Posts: 252
Joined: Mon Sep 19, 2005 10:28 pm
Location: Melbourne, Australia
Contact:

Post by rleishman »

The SELECT COUNT(1)thing is a myth.

Oracle is getting a bit clever nowadays for the SELECT COUNT(*) FROM (...) wrapper. It can almost always work out which columns will not actually be projected, and then no select them.

As a workaround, in SQL*Plus you can save your SELECT statement in a script, say a.sql, and then run

Code: Select all

set pages 0 lines 32767 trimspool on termout off
@a.sql
set termout on
Ross Leishman
I_Server_Whale
Premium Member
Premium Member
Posts: 1255
Joined: Wed Feb 02, 2005 11:54 am
Location: United States of America

Post by I_Server_Whale »

rleishman wrote:The SELECT COUNT(1)thing is a myth.
Many thanks for busting the myth. :) Now, I can confront one of my buddies who still believes in it.
Anything that won't sell, I don't want to invent. Its sale is proof of utility, and utility is success.
Author: Thomas A. Edison 1847-1931, American Inventor, Entrepreneur, Founder of GE
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

Yeah, us old farts who worked thru the releases of Oracle where the optimizer was dumber than a stump. :wink: I get enough grief from folks who look at my code and wonder why I do things a certain way, my ready answer is it's old school.

Remember RBO versus CBO? The optimizer has gotten a lot smarter, as long as statistics are valid and current.


Back to the point of the posting, I maybe strayed further off-topic than helpful, but I was stressing the fact that when trying to compare DataStage's ability to receive data versus Toad's ability to display the prefetch data is not a fair comparison. Plus, I was also trying to say don't get caught doing a select count(*), because that's faulty as well.

The ORDER BY forces the result set to be collected, which points to the ability for the database to get the data ready to spool. After that, you're needing to measure are you network bound sending the data to the receiving DS job, are you cpu bound by the receiving DS job, or are you putting backpressure from downstream logic? By timing the duration to get rows after ORDERing, you can set the expectation of time to fetch and then deal with time to spool. Unless someone wants to argue about the ORDERing suggestion...
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

kcbland wrote:This posters problem is the all-in-one job that new users always write. They never understand that their transformation pipeline runs at the speed of the slowest consuming process. In this case, it's obviously the receiving database putting back-pressure all the way up the pipeline.
I thought this was the point. Everything else was just... tangential. :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
rleishman
Premium Member
Premium Member
Posts: 252
Joined: Mon Sep 19, 2005 10:28 pm
Location: Melbourne, Australia
Contact:

Post by rleishman »

I'm sorry, did I miss the meeting where we had to stay on-topic and avoid interesting conversations. ;)

I came in late and cannot see some of the posts, so I'm not sure whether we are saying that ORDER BY is a good method or a bad method of ensuring all rows are fetched.

To be clear, it is BAD for two reasons:
1. It will almost always ADD to the runtime, often by orders of magnitude for very large data sets.
2. In one case, it doesn't acutally fetch all of the rows - see below - note that the plans both with and without the ORDER BY are identical - there is no SORT step because Oracle is using an index to sort the results.

Code: Select all

  1  explain plan set statement_id='    190427' for
  2  select *
  3  from ef_actl_expns
  4  where src_sys > 'J'
  5* order by src_sys, ldgr_id

Explained.

4      SELECT STATEMENT (Optimizer='CHOOSE')
1        TABLE ACCESS (BY GLOBAL INDEX ROWID) of 'EF_ACTL_EXPNS'[EF_ACTL_EXPNS@SEL$1] (1 rows)
1          INDEX (RANGE SCAN) of 'EF_AEXP_PK'[EF_ACTL_EXPNS@SEL$1] (1 rows)

Code: Select all

  1  explain plan set statement_id='    190427' for
  2  select *
  3  from ef_actl_expns
  4* where src_sys > 'J'

Explained.

4      SELECT STATEMENT (Optimizer='CHOOSE')
1        TABLE ACCESS (BY GLOBAL INDEX ROWID) of 'EF_ACTL_EXPNS'[EF_ACTL_EXPNS@SEL$1] (1 rows)
1          INDEX (RANGE SCAN) of 'EF_AEXP_PK'[EF_ACTL_EXPNS@SEL$1] (1 rows)
Perhaps a better way to time data retrieval without the network overhead of fetching all of the rows to the client is to wrap the SQL per Craig's suggestion, but use a MIN or MAX on every projected column:

Code: Select all

SELECT MIN(col1), MIN(col2), ....
FROM (
    SELECT col1, col2 ....
    ..... rest of your query ....
)
Note that you cannot use COUNT. The CBO is clever enough to know that COUNT(NotNullCol) is equivalent to COUNT(*).
Ross Leishman
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

You're missing the commentary where I'm talking about complex queries. Simple queries often don't change their plans as in your example, but when you have complex queries (including columns derived as subqueries), bypassing the columns can have profound performance differences.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
rleishman
Premium Member
Premium Member
Posts: 252
Joined: Mon Sep 19, 2005 10:28 pm
Location: Melbourne, Australia
Contact:

Post by rleishman »

Agree; you cannot bypass the columns, which is what COUNT(*) does (these days). But ORDER BY is wrong for different reasons.

The method described above (MIN every column) forces Oracle to read every projected column (including user-defined functions and scalar sub-queries) without the additional overhead of writing masses of data to temp segments.

Of course, if the original SQL actually has an ORDER BY, we're in big trouble. The optimizer may or may not choose to honour the nested ORDER-BY. But I digress... ;)

I will now attempt a trick famous across IT forums worldwide - the backflip!

Forget all the other methods! Go to SQL*Plus (I know you've got it, you just don't use it!) and type:

Code: Select all

SET AUTOTRACE TRACEONLY
and now run your original query in its original form. Oracle will do all the work and retrieve all of the data, but none of it will be brought across the network. Since the SQL is unchanged, the plan will be the same (providing you are using the same session parameters).

Now, does anyone have any complaints about that?
Ross Leishman
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

No complaints, just a hat tip. The point is to refute measuring first row returned performance against what the user could expect for spooling the full result set.

If it takes 1 millisecond to get the first row in Toad, how do you trick Toad into telling you how long it would take to get the last row? By imagining it would take 0 seconds to output the total result set, the only question is how long would it take the database to create the result set.

I had suggested using the ORDER BY as a trick. Maybe rleishman's suggestion gives a better result? Does it return 0 rows? Can we just put that into the BEFORE-SQL tab in the job? We sometimes just put @FALSE into the first Transformer stage output constraint, so you actually force the database to gather up and send all results across the network. This gives you the top-level spool performance of the network and DataStage cpu speed, but I'm looking for a faster trick to make a new user appreciate the time it takes to output data.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Post Reply