Aggregation performance
Moderators: chulett, rschirm, roy
Aggregation performance
Hi All,
Suggestion Needed for the following situation.
I have Five Flat files(each are in 2-5 millions) in the same format which I need to insert into Single table after doing the aggregation.
Database : DB2 ver 7 AIX
Datastage : Ver 5.2
Question is what is the most efficient way of doing this.
Following are the options which I can think off.
1. Appending 5 files to 1 file using unix cat and then use bulk load to load into the staging table and then do the aggregation on the staging table and insert into the target.
2. Append 5 files and do the Unix sort on key columns and create a text file. Use the sorted text file to do aggregation and create a text to bulk load into the table
Also want to know which of the following is fast.
1. Doing aggregation on the unsorted text file OR
2. Doing aggregation on the table
Kindly suggest.
Thanks
Rasi
Suggestion Needed for the following situation.
I have Five Flat files(each are in 2-5 millions) in the same format which I need to insert into Single table after doing the aggregation.
Database : DB2 ver 7 AIX
Datastage : Ver 5.2
Question is what is the most efficient way of doing this.
Following are the options which I can think off.
1. Appending 5 files to 1 file using unix cat and then use bulk load to load into the staging table and then do the aggregation on the staging table and insert into the target.
2. Append 5 files and do the Unix sort on key columns and create a text file. Use the sorted text file to do aggregation and create a text to bulk load into the table
Also want to know which of the following is fast.
1. Doing aggregation on the unsorted text file OR
2. Doing aggregation on the table
Kindly suggest.
Thanks
Rasi
Hi Rasi,
I was facing a similar challenge. Here my suggestion:
1) Create a job to load your final table using a job parameter called "TABLENAME" (STRING) for the source flat file name. Set target to "Insert without clearing" to append your table with five flat files. You can even use the DB2Load facility to gain the best performance. In this case, use the "INSERT" load option.
2) Create a "control" job containing (only) below job handling code:
for counter = 1 to 5
begin case
case counter = 1
TABLE = 'NAME OF FLAT FILE 1'
case counter = 2
TABLE = 'NAME OF FLAT FILE 2'
case counter = 3
TABLE = 'NAME OF FLAT FILE 3'
case counter = 4
TABLE = 'NAME OF FLAT FILE 4'
case counter = 5
TABLE = 'NAME OF FLAT FILE 5'
end case
hjob1 = DSAttachJob("NAME OF YOUR JOB",DSJ.ERRFATAL)
ErrCode = DSSetParam(hjob1,"TABLENAME",TABLE)
ErrCode = DSRunJob(hjob1,DSJ.RUNNORMAL)
ErrCode = DSWaitForJob(hjob1)
ErrCode = DSDetachJob(hjob1)
next
3) Modify this code and replace "NAME OF YOUR JOB" and "NAME OF FLAT FILE" text with the name that apply to your situation.
4) Run the control job (the control job will call the main job to load your table and will load the five flat files sequentially.
Regarding your last question:
Use DB2 for aggregating (this always faster than DataStage)!
Good luck!
I was facing a similar challenge. Here my suggestion:
1) Create a job to load your final table using a job parameter called "TABLENAME" (STRING) for the source flat file name. Set target to "Insert without clearing" to append your table with five flat files. You can even use the DB2Load facility to gain the best performance. In this case, use the "INSERT" load option.
2) Create a "control" job containing (only) below job handling code:
for counter = 1 to 5
begin case
case counter = 1
TABLE = 'NAME OF FLAT FILE 1'
case counter = 2
TABLE = 'NAME OF FLAT FILE 2'
case counter = 3
TABLE = 'NAME OF FLAT FILE 3'
case counter = 4
TABLE = 'NAME OF FLAT FILE 4'
case counter = 5
TABLE = 'NAME OF FLAT FILE 5'
end case
hjob1 = DSAttachJob("NAME OF YOUR JOB",DSJ.ERRFATAL)
ErrCode = DSSetParam(hjob1,"TABLENAME",TABLE)
ErrCode = DSRunJob(hjob1,DSJ.RUNNORMAL)
ErrCode = DSWaitForJob(hjob1)
ErrCode = DSDetachJob(hjob1)
next
3) Modify this code and replace "NAME OF YOUR JOB" and "NAME OF FLAT FILE" text with the name that apply to your situation.
4) Run the control job (the control job will call the main job to load your table and will load the five flat files sequentially.
Regarding your last question:
Use DB2 for aggregating (this always faster than DataStage)!
Good luck!
I agreed with mharkema, but If you update with version 6 and PX (I think that can be a good Idea with the numbers of rows that you have), my preference is :
- Do a cat in Unix (like you propose)
- Use Px for the Aggregation. I prefer this method that can be better for the evolution of your system.
Moreover, I don't know your windows for loading but with this method you can run during the day and charge at night.
Pey
- Do a cat in Unix (like you propose)
- Use Px for the Aggregation. I prefer this method that can be better for the evolution of your system.
Moreover, I don't know your windows for loading but with this method you can run during the day and charge at night.
Pey
Hi,
if you can't get the agregated data via sql queries,
I think you might want to try (or check perfomance at least) to agregate the 5 files in paralel, to save time, if they contain agregate data which should be joined together you could combine the results with lookups or whatever you need (I don't know the exact situation).
this might proove to be of resonable performance.
Good Luck,
Roy R.
if you can't get the agregated data via sql queries,
I think you might want to try (or check perfomance at least) to agregate the 5 files in paralel, to save time, if they contain agregate data which should be joined together you could combine the results with lookups or whatever you need (I don't know the exact situation).
this might proove to be of resonable performance.
Good Luck,
Roy R.
-
- Participant
- Posts: 3593
- Joined: Thu Jan 23, 2003 5:25 pm
- Location: Australia, Melbourne
- Contact:
So what's the difference between Syncsort and Cosort?
If you were just using the standard tools I'd say the bulk load into a database staging table and database aggregation will be faster than the DataStage aggregation stage. I avoid the aggregation stage for large loads due to a nasty error I was getting in an earlier version of DataStage which gave me a nervous tic. If you've got the money and you do a lot of big flat file processing then I'd certainly look at one of the flat file sort tools which should help you increase the performance of your jobs.
Vincent McBurney
Data Integration Services
www.intramatix.com
If you were just using the standard tools I'd say the bulk load into a database staging table and database aggregation will be faster than the DataStage aggregation stage. I avoid the aggregation stage for large loads due to a nasty error I was getting in an earlier version of DataStage which gave me a nervous tic. If you've got the money and you do a lot of big flat file processing then I'd certainly look at one of the flat file sort tools which should help you increase the performance of your jobs.
Vincent McBurney
Data Integration Services
www.intramatix.com
SyncSort and CoSort are *basically* the same tool from two different vendors. I can't say if one is 'better' or faster, having used neither, but CoSort does have one advantage. From what I recall, they developed a CoSort plugin so that it can be accessed within jobs as well as externally.
Either would certainly be much faster than the Aggregator for large volumes.
-craig
Either would certainly be much faster than the Aggregator for large volumes.
-craig
You're on DS 5.2, so you would need to upgrade to DS 6.0 to get the PX version with the nice GUI, otherwise you're writing shell and OSH scripts in the 5.2 version. So rule that suggestion out.
Now you've got to go out and buy SyncSort or CoSort, so rule that out. You need this solution today, right?
You've got DB2, but what kind of horsepower? If you've got 24 cpu's to play with, and only (and I mean only) 10-25 million rows to aggregate, the easiest solution is the database one. Not knowing anything about your aggregation rules, I'll make an easy one up for example.
Let's pretend you have a nice numeric key column in your group by set of columns that could give you a nice even distribution of the data set. You could create your aggregation logic in a view, or a series of layered views off your work table. Then, you could run N number of instantiated DataStage jobs, selecting from the view, using a WHERE MOD(yourMODcolumn, #NumberOfJobInstances#) = #ThisJobInstance# -1. This would run N jobs each firing off a query against your view, subsetting the result set by your nice numeric key column. You have just "instantiated" your query, along with your jobs. The result set falls along nice lines with no cross-over between aggregate groups. By writing the data to a sequential output file with the #ThisJobInstance# in the output file name, you create N number of bulk-load ready insert files. You now can take and load that data in parallel into your target table.
I do larger tables than yours this way. Next weekend I'll be loading up aggregate tables using this type of technique, where the base fact table is 800 million rows. We turn around aggregate recalcs in 12 hours or less. This type of technique is simple, doesn't require PX, SyncSort, CoSort, etc. Somehow we all managed without those tools. Now, if you're talking about large volumes of denormalized, flat data in sequential files, the sorting tools really shine. If you're doing anything along the lines of joins during the aggregation, they get dicey. They like denormalized files, no matter what the salesman tells you. PX is a lot of baggage, have you seen the aggregation stage? You use the right mouse button just as much as the left, because it's not point and click, it's right mouse windows pulldowns one after the other.
Give the database try a shot, you'll have that working long before your PO gets approved for more ETL tools. Once you let SyncSort in the door, that salesrep will be trying to sell you their ETL toolset.
Kenneth Bland
Now you've got to go out and buy SyncSort or CoSort, so rule that out. You need this solution today, right?
You've got DB2, but what kind of horsepower? If you've got 24 cpu's to play with, and only (and I mean only) 10-25 million rows to aggregate, the easiest solution is the database one. Not knowing anything about your aggregation rules, I'll make an easy one up for example.
Let's pretend you have a nice numeric key column in your group by set of columns that could give you a nice even distribution of the data set. You could create your aggregation logic in a view, or a series of layered views off your work table. Then, you could run N number of instantiated DataStage jobs, selecting from the view, using a WHERE MOD(yourMODcolumn, #NumberOfJobInstances#) = #ThisJobInstance# -1. This would run N jobs each firing off a query against your view, subsetting the result set by your nice numeric key column. You have just "instantiated" your query, along with your jobs. The result set falls along nice lines with no cross-over between aggregate groups. By writing the data to a sequential output file with the #ThisJobInstance# in the output file name, you create N number of bulk-load ready insert files. You now can take and load that data in parallel into your target table.
I do larger tables than yours this way. Next weekend I'll be loading up aggregate tables using this type of technique, where the base fact table is 800 million rows. We turn around aggregate recalcs in 12 hours or less. This type of technique is simple, doesn't require PX, SyncSort, CoSort, etc. Somehow we all managed without those tools. Now, if you're talking about large volumes of denormalized, flat data in sequential files, the sorting tools really shine. If you're doing anything along the lines of joins during the aggregation, they get dicey. They like denormalized files, no matter what the salesman tells you. PX is a lot of baggage, have you seen the aggregation stage? You use the right mouse button just as much as the left, because it's not point and click, it's right mouse windows pulldowns one after the other.
Give the database try a shot, you'll have that working long before your PO gets approved for more ETL tools. Once you let SyncSort in the door, that salesrep will be trying to sell you their ETL toolset.
Kenneth Bland
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Ken once made the comment that Red Brick users are spoiled with their bulk loader. He's right.
Another way that they're spoiled is that the bulk loader can be made to update the aggregate tables automatically! [8D]
(Off the same data set that is used to load the main tables.)
Ray Wurlod
Education and Consulting Services
ABN 57 092 448 518
Another way that they're spoiled is that the bulk loader can be made to update the aggregate tables automatically! [8D]
(Off the same data set that is used to load the main tables.)
Ray Wurlod
Education and Consulting Services
ABN 57 092 448 518
Self-serving message from CoSORT here ... for those who forgot about
the "Sort data faster with CoSORT for DataStage" advertised above in
this very form, and available from Tools4DataStage by the way, plus:
http://www.cosort.com/public/solutions/ ... astage.htm
Not only 8-years ahead of Syncsort on UNIX, CoSORT is faster, has
the PlugIn, and the external tools that are easier to use and far
less expensive than leasing rehashed mainframe syntax. Plenty of
references and free trials for CoSORT's PlugIn and CoSORT's high-
evel transformation DDL/DML (sortcl program) are available on all
UNIX platforms as well and Windows NT/XP/2K which has as Java GUI.
Please visit cosort.com or call me at 800-333-SORT, x238 for price,
which is for lower than you likely think, and is for perpetual use.
Thanks! David Friedland VP, CoSORT/IRI, Inc.
the "Sort data faster with CoSORT for DataStage" advertised above in
this very form, and available from Tools4DataStage by the way, plus:
http://www.cosort.com/public/solutions/ ... astage.htm
Not only 8-years ahead of Syncsort on UNIX, CoSORT is faster, has
the PlugIn, and the external tools that are easier to use and far
less expensive than leasing rehashed mainframe syntax. Plenty of
references and free trials for CoSORT's PlugIn and CoSORT's high-
evel transformation DDL/DML (sortcl program) are available on all
UNIX platforms as well and Windows NT/XP/2K which has as Java GUI.
Please visit cosort.com or call me at 800-333-SORT, x238 for price,
which is for lower than you likely think, and is for perpetual use.
Thanks! David Friedland VP, CoSORT/IRI, Inc.