Aggregation performance

rasi · Post by **rasi** » Tue Jul 22, 2003 4:43 am

Hi All,

Suggestion Needed for the following situation.

I have Five Flat files(each are in 2-5 millions) in the same format which I need to insert into Single table after doing the aggregation.

Database : DB2 ver 7 AIX
Datastage : Ver 5.2

Question is what is the most efficient way of doing this.

Following are the options which I can think off.
1. Appending 5 files to 1 file using unix cat and then use bulk load to load into the staging table and then do the aggregation on the staging table and insert into the target.
2. Append 5 files and do the Unix sort on key columns and create a text file. Use the sorted text file to do aggregation and create a text to bulk load into the table

Also want to know which of the following is fast.
1. Doing aggregation on the unsorted text file OR
2. Doing aggregation on the table

Kindly suggest.

Thanks
Rasi

mharkema · Post by **mharkema** » Tue Jul 22, 2003 5:35 am

Hi Rasi,

I was facing a similar challenge. Here my suggestion:

1) Create a job to load your final table using a job parameter called "TABLENAME" (STRING) for the source flat file name. Set target to "Insert without clearing" to append your table with five flat files. You can even use the DB2Load facility to gain the best performance. In this case, use the "INSERT" load option.
2) Create a "control" job containing (only) below job handling code:

for counter = 1 to 5

begin case
case counter = 1
TABLE = 'NAME OF FLAT FILE 1'
case counter = 2
TABLE = 'NAME OF FLAT FILE 2'
case counter = 3
TABLE = 'NAME OF FLAT FILE 3'
case counter = 4
TABLE = 'NAME OF FLAT FILE 4'
case counter = 5
TABLE = 'NAME OF FLAT FILE 5'
end case

hjob1 = DSAttachJob("NAME OF YOUR JOB",DSJ.ERRFATAL)
ErrCode = DSSetParam(hjob1,"TABLENAME",TABLE)
ErrCode = DSRunJob(hjob1,DSJ.RUNNORMAL)
ErrCode = DSWaitForJob(hjob1)
ErrCode = DSDetachJob(hjob1)

next

3) Modify this code and replace "NAME OF YOUR JOB" and "NAME OF FLAT FILE" text with the name that apply to your situation.

4) Run the control job (the control job will call the main job to load your table and will load the five flat files sequentially.

Regarding your last question:
Use DB2 for aggregating (this always faster than DataStage)!

Good luck!

Peytot · Post by **Peytot** » Tue Jul 22, 2003 5:59 am

I agreed with mharkema, but If you update with version 6 and PX (I think that can be a good Idea with the numbers of rows that you have), my preference is :

- Do a cat in Unix (like you propose)
- Use Px for the Aggregation. I prefer this method that can be better for the evolution of your system.
Moreover, I don't know your windows for loading but with this method you can run during the day and charge at night.

Pey

roy · Post by **roy** » Tue Jul 22, 2003 8:30 am

Hi,
if you can't get the agregated data via sql queries,
I think you might want to try (or check perfomance at least) to agregate the 5 files in paralel, to save time, if they contain agregate data which should be joined together you could combine the results with lookups or whatever you need (I don't know the exact situation).
this might proove to be of resonable performance.

Good Luck,

Roy R.

kduke · Post by **kduke** » Tue Jul 22, 2003 9:18 am

Rasi

If you are updaing millions of rows then you need syncsort. It is lots faster and can aggregate as well.

Kim.

Peytot · Post by **Peytot** » Tue Jul 22, 2003 9:26 am

Ok Kim, but I'm not sure that this organization want to buy it. It's expensive. But that's right, in my case, we for this volume we use :
- Cat in Unix
- Syncsort (that's very very fast)
- PX for the Aggregation.

It's a choice

Pey

kduke · Post by **kduke** » Tue Jul 22, 2003 10:56 am

Pey

I don't think syncsort is very expensive. It is lot less expensive than PX. You can do also aggregate in temp tables in the target database. Bulk load first. With millions of rows then I would try a few tricks. Your performance gains could make it worth trying a few options.

Kim.

Peytot · Post by **Peytot** » Tue Jul 22, 2003 11:23 am

Agree absolutely with you, but it is a software moreover.
My actual organization have these solfwares but I know another organization who don't want syncsort because they do not want to pay another licence for a new software.
But Syncsort is really the best solution (very fast).

Pey

vmcburney · Post by **vmcburney** » Tue Jul 22, 2003 4:05 pm

So what's the difference between Syncsort and Cosort?

If you were just using the standard tools I'd say the bulk load into a database staging table and database aggregation will be faster than the DataStage aggregation stage. I avoid the aggregation stage for large loads due to a nasty error I was getting in an earlier version of DataStage which gave me a nervous tic. If you've got the money and you do a lot of big flat file processing then I'd certainly look at one of the flat file sort tools which should help you increase the performance of your jobs.

Vincent McBurney
Data Integration Services
www.intramatix.com

chulett · Post by **chulett** » Tue Jul 22, 2003 7:05 pm

SyncSort and CoSort are *basically* the same tool from two different vendors. I can't say if one is 'better' or faster, having used neither, but CoSort does have one advantage. From what I recall, they developed a CoSort plugin so that it can be accessed within jobs as well as externally.

Either would certainly be much faster than the Aggregator for large volumes.

-craig

rasi · Post by **rasi** » Tue Jul 22, 2003 9:25 pm

Hi Guys

Thanks for your answers

Without buying an additional tool. Can we do this in the operating system level in Unix. If so how to do it.

Cheers
Rasi

kcbland · Post by **kcbland** » Tue Jul 22, 2003 9:31 pm

You're on DS 5.2, so you would need to upgrade to DS 6.0 to get the PX version with the nice GUI, otherwise you're writing shell and OSH scripts in the 5.2 version. So rule that suggestion out.

Now you've got to go out and buy SyncSort or CoSort, so rule that out. You need this solution today, right?

You've got DB2, but what kind of horsepower? If you've got 24 cpu's to play with, and only (and I mean only) 10-25 million rows to aggregate, the easiest solution is the database one. Not knowing anything about your aggregation rules, I'll make an easy one up for example.

Let's pretend you have a nice numeric key column in your group by set of columns that could give you a nice even distribution of the data set. You could create your aggregation logic in a view, or a series of layered views off your work table. Then, you could run N number of instantiated DataStage jobs, selecting from the view, using a WHERE MOD(yourMODcolumn, #NumberOfJobInstances#) = #ThisJobInstance# -1. This would run N jobs each firing off a query against your view, subsetting the result set by your nice numeric key column. You have just "instantiated" your query, along with your jobs. The result set falls along nice lines with no cross-over between aggregate groups. By writing the data to a sequential output file with the #ThisJobInstance# in the output file name, you create N number of bulk-load ready insert files. You now can take and load that data in parallel into your target table.

I do larger tables than yours this way. Next weekend I'll be loading up aggregate tables using this type of technique, where the base fact table is 800 million rows. We turn around aggregate recalcs in 12 hours or less. This type of technique is simple, doesn't require PX, SyncSort, CoSort, etc. Somehow we all managed without those tools. Now, if you're talking about large volumes of denormalized, flat data in sequential files, the sorting tools really shine. If you're doing anything along the lines of joins during the aggregation, they get dicey. They like denormalized files, no matter what the salesman tells you. PX is a lot of baggage, have you seen the aggregation stage? You use the right mouse button just as much as the left, because it's not point and click, it's right mouse windows pulldowns one after the other.

Give the database try a shot, you'll have that working long before your PO gets approved for more ETL tools. Once you let SyncSort in the door, that salesrep will be trying to sell you their ETL toolset.

Kenneth Bland

ray.wurlod · Post by **ray.wurlod** » Tue Jul 22, 2003 10:27 pm

Ken once made the comment that Red Brick users are spoiled with their bulk loader. He's right.
Another way that they're spoiled is that the bulk loader can be made to update the aggregate tables automatically! [8D]
(Off the same data set that is used to load the main tables.)

Ray Wurlod
Education and Consulting Services
ABN 57 092 448 518

davidf · Post by **davidf** » Thu Jul 31, 2003 8:41 am

Self-serving message from CoSORT here ... for those who forgot about
the "Sort data faster with CoSORT for DataStage" advertised above in
this very form, and available from Tools4DataStage by the way, plus:
http://www.cosort.com/public/solutions/ ... astage.htm

Not only 8-years ahead of Syncsort on UNIX, CoSORT is faster, has
the PlugIn, and the external tools that are easier to use and far
less expensive than leasing rehashed mainframe syntax. Plenty of
references and free trials for CoSORT's PlugIn and CoSORT's high-
evel transformation DDL/DML (sortcl program) are available on all
UNIX platforms as well and Windows NT/XP/2K which has as Java GUI.

Please visit cosort.com or call me at 800-333-SORT, x238 for price,
which is for lower than you likely think, and is for perpetual use.

Thanks! David Friedland VP, CoSORT/IRI, Inc.

ariear · Post by **ariear** » Thu Jul 31, 2003 11:59 am

Hi,

Take what Roy suggested - create a multiple sequencer/batch pre-sort the input file aggregate them with DataStage insert them into the same table then re-aggregate that table. The performance of this solution dependes upon the cardinality of each file