Page 1 of 1

Aggregating

Posted: Mon Mar 03, 2008 1:05 am
by U
Can I please have best practice suggestion on this scenario? We have to take an inner join between two Oracle 9i tables, one having 50 million rows, the other having only a few thousand, and then to group by four columns and form sum, minimum or count of six other columns.

Some colleagues are suggesting that this would blow tablespace in Oracle and that we should use ORDER BY instead in Oracle, and use an Aggregator stage to effect the grouping.

One of the grouping columns is from the smaller table, the others are from the larger table. Another possibility, therefore, would be to sort just on the main table columns, and to advise the Aggregator stage that just those columns are sorted.

Would you do the grouping in the Oracle database, or would you sort in the database and do the grouping in DataStage (either method)?

And could someone please give guidance about whether group by four columns in Oracle 9i will consume more or less memory than order by the same four columns when the columns come from two tables?

Thank you for your time, wisdom and experience.

Posted: Mon Mar 03, 2008 3:21 am
by WoMaWil
If it is an Inner Join you have at least join the two tables in the DataBase, so why not sort and agg them there, that is more performant then doing it outside. If the task is to big for the database (and/or DataStage) there might be way to solve it in steps. Sometimes programming is also a kind of try and error to find out the best performance.

Re: Aggregating

Posted: Mon Mar 03, 2008 8:01 am
by chulett
U wrote:Some colleagues are suggesting that this would blow tablespace in Oracle
That totally depends on your sql / explain plan, how Oracle decides to implement the table join. Drive from the small table and use nested loops against the larger and you should be fine. If you need to use hints to help it out, do so.

Posted: Mon Mar 03, 2008 8:38 am
by kcbland
You need to weigh the benefits of keeping the work within the database. If those 50 million rows will collapse to 50 rows upon summarization, you probably want to keep the work within the database.

The rational is that it may be more expensive to extract and send across a network (even if within the same server) 50 million rows than 50 rows.

Since we have no idea how complicated the SQL will be for your summarization, it may be that it's simpler and faster within DS, even given the massive extract of 50 million rows.

Now, if DataStage is Server Edition, you need to consider the power of the DS server compared to the database server. If it's parallel query edition of Oracle, you may find that your SQL, if properly written using temporary work tables and parallel DML, may be able to efficiently and quickly build your results. However, in order to more effectively process 50 million rows you will need to devise a partitioned parallelism approach your job design using multi-instance transformation jobs, as well as a multi-instance extraction job to more quickly get the data out of the database.

Your question seems more like a datamart build process. We have different methods for job design standards depending on the type of ETL work we are doing. Source system to staging database tends to be more "dump and pump", meaning high-performance extract and bulk load into a staging database. Staging to EDW is more complicated and involves heavy transformation, surrogate key management, SCD etc.

But EDW to mart is often easily accomplished using SQL scripts since most of this work is deploying copies of dimensions to marts, collapsing snow-flaked dimensions into super-wide single dimensions (an Aggregate Dimension), and maintaining fact based aggregates. SQL tends to be the easiest solution for this.

Posted: Mon Mar 03, 2008 3:10 pm
by ray.wurlod
We have a similar job where I am currently working. The explain plan suggests that grouping only has a lower cost than sorting only assuming that the join between the two tables is specified identically. (Our smaller table has 20,000 rows, not "just a few thousand". Our larger table has about six million rows.)

Posted: Mon Mar 03, 2008 10:13 pm
by U
Thank you for your replies. We had our Oracle DBA look at explain output and it was found that grouping was less cost than sorting, based on our data. The only reason we would have needed to sort would have been to allow the Aggregator stage to process the volume of data in question. So we are planning to perform the grouping in the database.