DSXchange

Posted: **Mon Jan 08, 2007 2:41 pm**

Just so we're all on the same page, if you do this:

select count(*) from 
(
SELECT A.*, B.*, C.* FROM TABLE_A, TABLE_B, TABLE_C 
WHERE A.XYZ=B.XYZ AND B.MNO=C.MNO 
AND B.EFFDT BETWEEN C.EFFDT AND C.ENDDT
)

I just verified it doesn't change the original explain plan nor the cost or cardinality, it just adds the SORT AGGREGATE over the VIEW it creates to get you the count. So either method should be fine, I would think.

Posted: **Mon Jan 08, 2007 5:42 pm**

Your query was probably not complicated enough. I've done this loads of time. You also get more efficient counts if you do SELECT COUNT(1) for the exact same reason.

Posted: **Mon Jan 08, 2007 8:53 pm**

The SELECT COUNT(1)thing is a myth.

Oracle is getting a bit clever nowadays for the SELECT COUNT(*) FROM (...) wrapper. It can almost always work out which columns will not actually be projected, and then no select them.

As a workaround, in SQL*Plus you can save your SELECT statement in a script, say a.sql, and then run

Code: Select all

set pages 0 lines 32767 trimspool on termout off
@a.sql
set termout on

Posted: **Mon Jan 08, 2007 9:00 pm**

rleishman wrote:The SELECT COUNT(1)thing is a myth.

Many thanks for busting the myth.

Now, I can confront one of my buddies who still believes in it.

Posted: **Mon Jan 08, 2007 11:18 pm**

Yeah, us old farts who worked thru the releases of Oracle where the optimizer was dumber than a stump.

I get enough grief from folks who look at my code and wonder why I do things a certain way, my ready answer is it's old school.

Remember RBO versus CBO? The optimizer has gotten a lot smarter, as long as statistics are valid and current.

Back to the point of the posting, I maybe strayed further off-topic than helpful, but I was stressing the fact that when trying to compare DataStage's ability to receive data versus Toad's ability to display the prefetch data is not a fair comparison. Plus, I was also trying to say don't get caught doing a select count(*), because that's faulty as well.

The ORDER BY forces the result set to be collected, which points to the ability for the database to get the data ready to spool. After that, you're needing to measure are you network bound sending the data to the receiving DS job, are you cpu bound by the receiving DS job, or are you putting backpressure from downstream logic? By timing the duration to get rows after ORDERing, you can set the expectation of time to fetch and then deal with time to spool. Unless someone wants to argue about the ORDERing suggestion...

Posted: **Tue Jan 09, 2007 8:23 am**

kcbland wrote:This posters problem is the all-in-one job that new users always write. They never understand that their transformation pipeline runs at the speed of the slowest consuming process. In this case, it's obviously the receiving database putting back-pressure all the way up the pipeline.

I thought this was the point. Everything else was just... tangential.

Posted: **Tue Jan 09, 2007 3:54 pm**

I'm sorry, did I miss the meeting where we had to stay on-topic and avoid interesting conversations.

I came in late and cannot see some of the posts, so I'm not sure whether we are saying that ORDER BY is a good method or a bad method of ensuring all rows are fetched.

To be clear, it is BAD for two reasons:
1. It will almost always ADD to the runtime, often by orders of magnitude for very large data sets.
2. In one case, it doesn't acutally fetch all of the rows - see below - note that the plans both with and without the ORDER BY are identical - there is no SORT step because Oracle is using an index to sort the results.

Code: Select all

  1  explain plan set statement_id='    190427' for
  2  select *
  3  from ef_actl_expns
  4  where src_sys > 'J'
  5* order by src_sys, ldgr_id

Explained.

4      SELECT STATEMENT (Optimizer='CHOOSE')
1        TABLE ACCESS (BY GLOBAL INDEX ROWID) of 'EF_ACTL_EXPNS'[EF_ACTL_EXPNS@SEL$1] (1 rows)
1          INDEX (RANGE SCAN) of 'EF_AEXP_PK'[EF_ACTL_EXPNS@SEL$1] (1 rows)

Code: Select all

  1  explain plan set statement_id='    190427' for
  2  select *
  3  from ef_actl_expns
  4* where src_sys > 'J'

Explained.

4      SELECT STATEMENT (Optimizer='CHOOSE')
1        TABLE ACCESS (BY GLOBAL INDEX ROWID) of 'EF_ACTL_EXPNS'[EF_ACTL_EXPNS@SEL$1] (1 rows)
1          INDEX (RANGE SCAN) of 'EF_AEXP_PK'[EF_ACTL_EXPNS@SEL$1] (1 rows)

Perhaps a better way to time data retrieval without the network overhead of fetching all of the rows to the client is to wrap the SQL per Craig's suggestion, but use a MIN or MAX on every projected column:

Code: Select all

SELECT MIN(col1), MIN(col2), ....
FROM (
    SELECT col1, col2 ....
    ..... rest of your query ....
)

Note that you cannot use COUNT. The CBO is clever enough to know that COUNT(NotNullCol) is equivalent to COUNT(*).

Posted: **Tue Jan 09, 2007 4:06 pm**

You're missing the commentary where I'm talking about complex queries. Simple queries often don't change their plans as in your example, but when you have complex queries (including columns derived as subqueries), bypassing the columns can have profound performance differences.

Posted: **Tue Jan 09, 2007 4:46 pm**

Agree; you cannot bypass the columns, which is what COUNT(*) does (these days). But ORDER BY is wrong for different reasons.

The method described above (MIN every column) forces Oracle to read every projected column (including user-defined functions and scalar sub-queries) without the additional overhead of writing masses of data to temp segments.

Of course, if the original SQL actually has an ORDER BY, we're in big trouble. The optimizer may or may not choose to honour the nested ORDER-BY. But I digress...

I will now attempt a trick famous across IT forums worldwide - the backflip!

Forget all the other methods! Go to SQL*Plus (I know you've got it, you just don't use it!) and type:

Code: Select all

SET AUTOTRACE TRACEONLY

and now run your original query in its original form. Oracle will do all the work and retrieve all of the data, but none of it will be brought across the network. Since the SQL is unchanged, the plan will be the same (providing you are using the same session parameters).

Now, does anyone have any complaints about that?

Posted: **Tue Jan 09, 2007 5:31 pm**

No complaints, just a hat tip. The point is to refute measuring first row returned performance against what the user could expect for spooling the full result set.

If it takes 1 millisecond to get the first row in Toad, how do you trick Toad into telling you how long it would take to get the last row? By imagining it would take 0 seconds to output the total result set, the only question is how long would it take the database to create the result set.

I had suggested using the ORDER BY as a trick. Maybe rleishman's suggestion gives a better result? Does it return 0 rows? Can we just put that into the BEFORE-SQL tab in the job? We sometimes just put @FALSE into the first Transformer stage output constraint, so you actually force the database to gather up and send all results across the network. This gives you the top-level spool performance of the network and DataStage cpu speed, but I'm looking for a faster trick to make a new user appreciate the time it takes to output data.

DSXchange

Performance Issue