Aggregating

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
U
Participant
Posts: 230
Joined: Tue Apr 17, 2007 8:23 pm
Location: Singapore

Aggregating

Post by U »

Can I please have best practice suggestion on this scenario? We have to take an inner join between two Oracle 9i tables, one having 50 million rows, the other having only a few thousand, and then to group by four columns and form sum, minimum or count of six other columns.

Some colleagues are suggesting that this would blow tablespace in Oracle and that we should use ORDER BY instead in Oracle, and use an Aggregator stage to effect the grouping.

One of the grouping columns is from the smaller table, the others are from the larger table. Another possibility, therefore, would be to sort just on the main table columns, and to advise the Aggregator stage that just those columns are sorted.

Would you do the grouping in the Oracle database, or would you sort in the database and do the grouping in DataStage (either method)?

And could someone please give guidance about whether group by four columns in Oracle 9i will consume more or less memory than order by the same four columns when the columns come from two tables?

Thank you for your time, wisdom and experience.
WoMaWil
Participant
Posts: 482
Joined: Thu Mar 13, 2003 7:17 am
Location: Amsterdam

Post by WoMaWil »

If it is an Inner Join you have at least join the two tables in the DataBase, so why not sort and agg them there, that is more performant then doing it outside. If the task is to big for the database (and/or DataStage) there might be way to solve it in steps. Sometimes programming is also a kind of try and error to find out the best performance.
Wolfgang Hürter
Amsterdam
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Re: Aggregating

Post by chulett »

U wrote:Some colleagues are suggesting that this would blow tablespace in Oracle
That totally depends on your sql / explain plan, how Oracle decides to implement the table join. Drive from the small table and use nested loops against the larger and you should be fine. If you need to use hints to help it out, do so.
-craig

"You can never have too many knives" -- Logan Nine Fingers
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

You need to weigh the benefits of keeping the work within the database. If those 50 million rows will collapse to 50 rows upon summarization, you probably want to keep the work within the database.

The rational is that it may be more expensive to extract and send across a network (even if within the same server) 50 million rows than 50 rows.

Since we have no idea how complicated the SQL will be for your summarization, it may be that it's simpler and faster within DS, even given the massive extract of 50 million rows.

Now, if DataStage is Server Edition, you need to consider the power of the DS server compared to the database server. If it's parallel query edition of Oracle, you may find that your SQL, if properly written using temporary work tables and parallel DML, may be able to efficiently and quickly build your results. However, in order to more effectively process 50 million rows you will need to devise a partitioned parallelism approach your job design using multi-instance transformation jobs, as well as a multi-instance extraction job to more quickly get the data out of the database.

Your question seems more like a datamart build process. We have different methods for job design standards depending on the type of ETL work we are doing. Source system to staging database tends to be more "dump and pump", meaning high-performance extract and bulk load into a staging database. Staging to EDW is more complicated and involves heavy transformation, surrogate key management, SCD etc.

But EDW to mart is often easily accomplished using SQL scripts since most of this work is deploying copies of dimensions to marts, collapsing snow-flaked dimensions into super-wide single dimensions (an Aggregate Dimension), and maintaining fact based aggregates. SQL tends to be the easiest solution for this.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

We have a similar job where I am currently working. The explain plan suggests that grouping only has a lower cost than sorting only assuming that the join between the two tables is specified identically. (Our smaller table has 20,000 rows, not "just a few thousand". Our larger table has about six million rows.)
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
U
Participant
Posts: 230
Joined: Tue Apr 17, 2007 8:23 pm
Location: Singapore

Post by U »

Thank you for your replies. We had our Oracle DBA look at explain output and it was found that grouping was less cost than sorting, based on our data. The only reason we would have needed to sort would have been to allow the Aggregator stage to process the volume of data in question. So we are planning to perform the grouping in the database.
Post Reply