Performance issue while loading frm flat file to oracle tab

prasad_chennai · Post by **prasad_chennai** » Wed Feb 10, 2010 6:59 am

Hello Everyone,

I have a requirement to update else insert a record while reading from a flat file into an oracle table.

The job flow is as below:
Oracle table(lookup table)
|
|
|
Flat File(21M records) --->Transformer------->Oracle table(with update else insert)

My source is a flat file and I am doing a lookup with another table based on a condition and doing a validation for special kind of records using stage variables and filtering those records & then loading into a table with an update else insert strategy.

Lookup condition will be something like:SNJSCA110BW010110021LSW02/IOM-1 ia a value in a column from source and I am extracting the portion upto the "/" i.e "SNJSCA110BW010110021LSW02" and in stage variables I am validating for a particular string in the source column on which I am doing the lookup.

My performance has gone down too much and it is processing 5 records/sec. I got the indexes created on the table as well but of no use.

Could you please help me with how I can improve the performance.

Thanks in advance.
Prasad

chulett · Post by **chulett** » Wed Feb 10, 2010 7:02 am

Make sure you know where your bottleneck is, lookup or target. And your reference lookup should be to a hashed file not directly to a database table.

prasad_chennai · Post by **prasad_chennai** » Wed Feb 10, 2010 7:21 am

The performance is drastically going down as soon as it reaches the transformer stage. Is it something to do with the stage variables?
There is slowdown in performance while reading from the file and while reading from the lookup as well.

Thanks,
Prasad

chulett · Post by **chulett** » Wed Feb 10, 2010 7:27 am

No, it has everything to do with your reference lookup, I would wager. Each record does a singleton select across the network to your db. How efficient is your lookup sql? Not nearly as efficient as a hashed one would be.

vivekgadwal · Post by **vivekgadwal** » Wed Feb 10, 2010 7:50 am

prasad_chennai wrote:The performance is drastically going down as soon as it reaches the transformer stage. Is it something to do with the stage variables?
There is slowdown in performance while reading from the file and while reading from the lookup as well.

Prasad,

There are various factors for the performance decline. As Craig pointed out, the first one is the type of lookup you are performing. You do not want to be executing a direct database lookup, unless and until absolutely needed, as it will query the database for each row/block of rows. For 21 Million rows, that is a lot of traffic on the network.

How big is your Reference table? If possible, create a Hashed filed (if need be, you have to consider creating a 64-bit one...only if the data in the table could exceed 2 GB in physical size). Otherwise, if you like to persist with this, you can also load the seq. file into an intermediate table on the same database and try and do a join.

Try these out and hope our suggestions point you in the right direction :D

prasad_chennai · Post by **prasad_chennai** » Wed Feb 10, 2010 7:57 am

The lookup queries are quite fast and its around 12000 rec/sec but at the source it is 600rec/sec and after passing from the transformer its writing to the table at 400 rec /sec though I have changed the arraysize to 10000.

prasad_chennai · Post by **prasad_chennai** » Wed Feb 10, 2010 8:00 am

Thanks for the inputs Craig & Vivek.

I am using hash file in between the table and transformer now but there are many constraints and validation that are being performed.
The lookup queries are quite fast and its around 12000 rec/sec but at the source it is 600rec/sec and after passing from the transformer its writing to the table at 400 rec /sec though I have changed the arraysize to 10000.

At this rate it will still take lot of time to load the table.

Thx,
Prasad

chulett · Post by **chulett** » Wed Feb 10, 2010 8:04 am

So, if you remove the target OCI stage and replace it with a flat file target, the job runs at 12k r/s? Those 'combo actions' can be dog slow for volume loads, why not prequalify the action and run two links - dedicated Insert and Update links?

vivekgadwal · Post by **vivekgadwal** » Wed Feb 10, 2010 8:21 am

prasad_chennai wrote: I am using hash file in between the table and transformer now but there are many constraints and validation that are being performed.
The lookup queries are quite fast and its around 12000 rec/sec but at the source it is 600rec/sec and after passing from the transformer its writing to the table at 400 rec /sec though I have changed the arraysize to 10000.

At this rate it will still take lot of time to load the table.

Good, at least this is far better than the 5-6 r/s performance you are getting. Also, please do bear in mind that you are doing "Update else Insert". This will take a lot of time, on the database side as you are talking about handling inserts/updates for 21 Million rows.

However, it is good to have to dedicated streams, one for updates and the other for inserts. If your shop has a DBA Bulk load utility written, then you can load the inserts using that utility in a separate job. The issue always is with updates.

Also, one thing you could do, is to induce parallel functionality into your job. It involves splitting the data stream into however many number of links you desire and do a "modulo" division inside the transformer (generate a row-id or something for each row and modulo divide it with the number of partitions you desire. Based on the return value, route it into separate links using constraints).

There are multiple ways to speed up the process, but the question is, how much time are you willing to invest on tuning this one job?

Hope this helps...

DSguru2B · Post by **DSguru2B** » Wed Feb 10, 2010 12:56 pm

For 21 M rows I would split up the job. Find out the inserts vs updates and hold them in two different files. Then have two jobs handling the two files, independently.
For updates, make sure your keys are indexed with stats up-to-date.
For inserts utilize the ORABulk stage.