Performance issue while loading frm flat file to oracle tab

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
prasad_chennai
Participant
Posts: 38
Joined: Wed Jul 18, 2007 12:23 am
Location: Chennai

Performance issue while loading frm flat file to oracle tab

Post by prasad_chennai »

Hello Everyone,

I have a requirement to update else insert a record while reading from a flat file into an oracle table.

The job flow is as below:
Oracle table(lookup table)
|
|
|
Flat File(21M records) --->Transformer------->Oracle table(with update else insert)

My source is a flat file and I am doing a lookup with another table based on a condition and doing a validation for special kind of records using stage variables and filtering those records & then loading into a table with an update else insert strategy.

Lookup condition will be something like:SNJSCA110BW010110021LSW02/IOM-1 ia a value in a column from source and I am extracting the portion upto the "/" i.e "SNJSCA110BW010110021LSW02" and in stage variables I am validating for a particular string in the source column on which I am doing the lookup.

My performance has gone down too much and it is processing 5 records/sec. I got the indexes created on the table as well but of no use.

Could you please help me with how I can improve the performance.

Thanks in advance.
Prasad
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Make sure you know where your bottleneck is, lookup or target. And your reference lookup should be to a hashed file not directly to a database table.
-craig

"You can never have too many knives" -- Logan Nine Fingers
prasad_chennai
Participant
Posts: 38
Joined: Wed Jul 18, 2007 12:23 am
Location: Chennai

Post by prasad_chennai »

The performance is drastically going down as soon as it reaches the transformer stage. Is it something to do with the stage variables?
There is slowdown in performance while reading from the file and while reading from the lookup as well.

Thanks,
Prasad
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

No, it has everything to do with your reference lookup, I would wager. Each record does a singleton select across the network to your db. How efficient is your lookup sql? Not nearly as efficient as a hashed one would be. :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
vivekgadwal
Premium Member
Premium Member
Posts: 457
Joined: Tue Sep 25, 2007 4:05 pm

Post by vivekgadwal »

prasad_chennai wrote:The performance is drastically going down as soon as it reaches the transformer stage. Is it something to do with the stage variables?
There is slowdown in performance while reading from the file and while reading from the lookup as well.
Prasad,

There are various factors for the performance decline. As Craig pointed out, the first one is the type of lookup you are performing. You do not want to be executing a direct database lookup, unless and until absolutely needed, as it will query the database for each row/block of rows. For 21 Million rows, that is a lot of traffic on the network.

How big is your Reference table? If possible, create a Hashed filed (if need be, you have to consider creating a 64-bit one...only if the data in the table could exceed 2 GB in physical size). Otherwise, if you like to persist with this, you can also load the seq. file into an intermediate table on the same database and try and do a join.

Try these out and hope our suggestions point you in the right direction :D
Vivek Gadwal

Experience is what you get when you didn't get what you wanted
prasad_chennai
Participant
Posts: 38
Joined: Wed Jul 18, 2007 12:23 am
Location: Chennai

Post by prasad_chennai »

The lookup queries are quite fast and its around 12000 rec/sec but at the source it is 600rec/sec and after passing from the transformer its writing to the table at 400 rec /sec though I have changed the arraysize to 10000.
prasad_chennai
Participant
Posts: 38
Joined: Wed Jul 18, 2007 12:23 am
Location: Chennai

Post by prasad_chennai »

Thanks for the inputs Craig & Vivek.

I am using hash file in between the table and transformer now but there are many constraints and validation that are being performed.
The lookup queries are quite fast and its around 12000 rec/sec but at the source it is 600rec/sec and after passing from the transformer its writing to the table at 400 rec /sec though I have changed the arraysize to 10000.

At this rate it will still take lot of time to load the table.

Thx,
Prasad
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

So, if you remove the target OCI stage and replace it with a flat file target, the job runs at 12k r/s? Those 'combo actions' can be dog slow for volume loads, why not prequalify the action and run two links - dedicated Insert and Update links?
-craig

"You can never have too many knives" -- Logan Nine Fingers
vivekgadwal
Premium Member
Premium Member
Posts: 457
Joined: Tue Sep 25, 2007 4:05 pm

Post by vivekgadwal »

prasad_chennai wrote: I am using hash file in between the table and transformer now but there are many constraints and validation that are being performed.
The lookup queries are quite fast and its around 12000 rec/sec but at the source it is 600rec/sec and after passing from the transformer its writing to the table at 400 rec /sec though I have changed the arraysize to 10000.

At this rate it will still take lot of time to load the table.
Good, at least this is far better than the 5-6 r/s performance you are getting. Also, please do bear in mind that you are doing "Update else Insert". This will take a lot of time, on the database side as you are talking about handling inserts/updates for 21 Million rows.

However, it is good to have to dedicated streams, one for updates and the other for inserts. If your shop has a DBA Bulk load utility written, then you can load the inserts using that utility in a separate job. The issue always is with updates.

Also, one thing you could do, is to induce parallel functionality into your job. It involves splitting the data stream into however many number of links you desire and do a "modulo" division inside the transformer (generate a row-id or something for each row and modulo divide it with the number of partitions you desire. Based on the return value, route it into separate links using constraints).

There are multiple ways to speed up the process, but the question is, how much time are you willing to invest on tuning this one job?

Hope this helps...
Vivek Gadwal

Experience is what you get when you didn't get what you wanted
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

For 21 M rows I would split up the job. Find out the inserts vs updates and hold them in two different files. Then have two jobs handling the two files, independently.
For updates, make sure your keys are indexed with stats up-to-date.
For inserts utilize the ORABulk stage.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
Post Reply