Big file process with Aggregator stage - New big problem !

brunix · Post by **brunix** » Wed May 21, 2008 4:43 am

Hi,
We need some suggestion about this problem:

we have to process 30.000 sequential file daily containing totally about 300.000.000 records.

We have to load all records in one Oracle table.
In another table we have to load the same records grouped by two fields.
The aggregated records will be about 50.000.000

What is the best approach to implement this process ?

The aggregator stage can manage 300.000.000 records in few hours ?

The datastage server is installed on a AIX multiprocessor server.
For example to load an hashed file from a sequential file the process work about 20000 rows/sec.

Any suggest will be good,
Brunix

Cr.Cezon · Post by **Cr.Cezon** » Wed May 21, 2008 6:04 am

a solution could be to group the 30000 files in n files of a prudent size, and then load it in Oracle.

1.-to group files you can use the unix command cat file* > file_grouped

2.-I think it could be better group in BD than in DS, so first load in table and after do:
insert into table2 select columkey1 , columnkey2 from table
group by columkey1 , columnkey2

regards,
Cristina.

chulett · Post by **chulett** » Wed May 21, 2008 7:05 am

Define 'load'. Are there any transformations involved? All inserts? Insert / update / delete?

As noted, I'd probably opt for letting the database perform the aggregation for the second load.

brunix · Post by **brunix** » Wed May 21, 2008 7:50 am

chulett wrote:Define 'load'. ....

Hi Chulett,
For load I mean to insert every day 300 millions records in an detail oracle TABLE_1 (this table increase 300 millions daily).
In this load process we only have a lookup to an hashed file with 200.000 records to decode a field for each record in input.
We make a prototype and this job running at 6000 rows/sec so to charge all 300 millions rows we need about 13 hours.

Until this step no problem, we can process instead of one 300 millions records file 6 of 50 millions or 12 of 25 millions records...

The big issue is to have the aggregate TABLE_2 derived from the detail TABLE_1.

This aggregate table have around 50 millions rows. But every day when you insert new records in TABLE_1 you need to update TABLE_2 with new grouped values.

I describe you hour prototype:

SOURCE FILE====>Transformer with lookup |==first link===>aggregate records into an hashed or sequential file ==second link===>insert into TABLE_1

After this job we have a 50 millions records hashed/sequential file and we can use it as source for a job who will insert or update rows in TABLE_2.
I have also try an update or insert new row option but the performance was very bad.

Sorry for my english,
Brunix

ray.wurlod · Post by **ray.wurlod** » Wed May 21, 2008 3:26 pm

If there are only fully-additive facts (sum, count, min or max) in the aggregate table you can maintain it while inserting rows into the detail table (that is, from the same DataStage job or using some mechanism - such as a trigger or a stored procedure - within Oracle).

brunix · Post by **brunix** » Wed Jul 02, 2008 4:17 am

Hi all,
I have a big problem with my aggregators :D .

Since I open this post something has changed.

Now my job is as following :

[img=http://img397.imageshack.us/img397/5893/nuovoimmaginebitmapul8.th.jpg]

I have 10 sequential sorted files in input to 10 aggregators the output of each aggregator insert directly into the same table.

the problem is that evenif I sorted all sequential files the input links to the aggregators increase and the output links to the tabel remain at zero.

After 10-12 million rows the job go in abort with the anonymous message "Abnormal termination of stage Job0303G_Aggr_Voice..Aggr02 detected".
The Aggr02 is the second of 10 aggregators.

Why if I sorted the sequential input the aggregator don't send anything to his output link ?

Please is need help quickly, because we have to put the job in production environment ASAP.

Thanks,
Brunix

ArndW · Post by **ArndW** » Wed Jul 02, 2008 5:10 am

In the aggregator, did you specify the "sort" and "sort order" for any of your columns? If you didn't then the Aggregator stage has no way of knowing that the data is sorted and it doesn't have to store the data in memory for group changes.

chulett · Post by **chulett** » Wed Jul 02, 2008 6:59 am

Right. You need to ensure that the Aggregator knows your input is sorted or it will 'sort' it itself. Again. And go boom when the volume is too high to hold all in memory. Never mind the fact that your job design will run all ten segments at the same time, compounding any memory issue.

brunix · Post by **brunix** » Wed Jul 02, 2008 7:46 am

Yes,
I specify the sort fields and the sort order.
Only one thing, apart sort field and sor order i marked all field like key.
Can it would be this the problem ?
I just try again removing key specification....
In a minutes I reply with the results...
Thanks,
Brunix

chulett · Post by **chulett** » Wed Jul 02, 2008 8:00 am

You have to sort the data in a manner that actually supports the grouping being done or it will be ignored. And the stage will bust you if it finds you've lied about the order and abort with the ever appreciated 'row out of sequence' error.

You can tell you've got the sorting correct when rows flow into and out of the Aggregator stage as the job runs. If they go in but don't come out until everyone is in there, then you've missed something.

brunix · Post by **brunix** » Wed Jul 02, 2008 8:03 am

Dear friends,
bad news...
As you can see in the image below

[img=http://img365.imageshack.us/img365/6261/nuovoimmaginebitmapun7.th.jpg]

The job go in abort again.
Now the aggr_01 goes wrong.
The very strange thing is that after the abort the other nine aggregators go until the end, inserting rows in the table !!
How is it possible ?
I don't know what is the possible solution.
Did you think if I divide this job in two jobs of 5 aggregators each one I can resolve ?
Please I trust in you, help me !
Brunix

brunix · Post by **brunix** » Thu Jul 03, 2008 7:16 am

Hi all,
I resolve the problem.
I notice that the aggregator must have minimum the same fields in input and output.
In my job in the output link of the aggregator i miss an unused field.
This caused the error.

Now I put a transformer between aggregator and OCI stage, so the aggregator have the same fields in I/O links and in the transformer I don't map the unused field .

Everything goes good and quickly.

Thanks at all,
Brunix

brunix · Post by **brunix** » Thu Jul 03, 2008 7:28 am

Hi all,
I resolve the problem.
I notice that the aggregator must have minimum the same fields in input and output.
In my job in the output link of the aggregator i miss (intentionally) an unused field.
This caused the error.

Now I put a transformer between aggregator and OCI stage, so the aggregator have the same fields in I/O links and in the transformer I don't map the unused field .

Everything goes good and quickly.

Thanks at all,
Brunix