how to improve performance of server jobs

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
jreddy
Premium Member
Premium Member
Posts: 202
Joined: Tue Feb 03, 2004 5:09 pm

how to improve performance of server jobs

Post by jreddy »

Guys,

I havent worked with jobs that populate more than a couple hundred thousands records in the past, so very naively i want to ask this question.. how long should a simple server job that reads from a view (view joins 4 tables.. but when i run this view in Toad, it gives results in 7 mins) returns about 2.8 million rows, goes thru a transformer (where i also assign a surrogate key to it by using an oracle sequence) and populates the target table.

Its taking about 8 hrs to run it. I have done the following: set the oracle 'rows per transaction' as 50000, array size at 1000. What else could i do to improve its performance (in reading or writing to database)

I also had another job that had 23GB flat file that had to be uploaded.. took about 18hrs.. is this normal.. :?

please share your statistics with me..thanks
raju_chvr
Premium Member
Premium Member
Posts: 165
Joined: Sat Sep 27, 2003 9:19 am
Location: USA

Re: how to improve performance of server jobs

Post by raju_chvr »

There is no fixed time the Server job takes time to run. It is dependent on many variables like ur network , DB server doing other tasks at that moment of time, how many other jobs are u running at the same time on the DS machine.

In your first scenario, I would suggest that do your join in DS by using OCI look-ups and dump the rows into hash-file before sendind the look-up data into the Transformer. Write the data into a flat file. Use ORABLK to loadthe data into ORACLE instead of regular ORAOCI or ODBC stages.
jreddy
Premium Member
Premium Member
Posts: 202
Joined: Tue Feb 03, 2004 5:09 pm

Post by jreddy »

yeah.. actually there is not much happening on the database when the job is run ..usually coz its after hours..same reason why network traffic might not play much part.. but about replacing oraoci with orabulk.. let me ask this.. i am doing a lot of user-defined sql on target end.. can i still do it in orabulk..?

also, the joins on tables i have put in the view, because..each table has about a million rows on the average and i didnt want to do a lookup in datastage job coz of the massive table size and ruled out the hash option as well.. coz all i am doing in that view is a join on 2 tables, but the other 2 just to check the EXISTS condition.. for which i dont have to have stages for them in the DS job...

however, as i said earlier too.. the view returns data in TOAD in just 7 minutes.. so, i wouldnt believe that it could be the culprit for the enormous time that the job takes..
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Re: how to improve performance of server jobs

Post by chulett »

jreddy wrote:when i run this view in Toad, it gives results in 7 mins
Just an FYI, but TOAD timings can be a little misleading depending on exactly what you mean by 'run this view'. If you are a simple 'select *' from the view in the SQL-Editor, the time it takes to return the initial data set in TOAD is not equivalent to extracting all of the rows. In essence you are getting just the first 'cursor full' of records when TOAD starts to display data. To fully time it, you'd need to drag the thumb/slider all the way to the bottom of the data grid and then wait for the last row to show up.

Also note that setting your array size too big can negatively impact performance. How much data is in an average record, byte wise? Multiply that times 1000 and you'll get the number of bytes you are trying to suck down your connection all at one time. I've yet to see any Rule of Thumb for setting this value, what I've done is experiment with various settings to see what seems to get me the best through-put.
-craig

"You can never have too many knives" -- Logan Nine Fingers
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Re: how to improve performance of server jobs

Post by kcbland »

jreddy wrote:how long should a simple server job that reads from a view (view joins 4 tables.. but when i run this view in Toad, it gives results in 7 mins)
This is invalid. Just running a query in Toad is not the same as spooling the data. If you do a "select count(*) from view" query, the database will optimize and cheat and not work thru the data, it will do index scans and other such methods to get around doing a lot of work to give you the count. In addition, you did not go thru the motions of actually spooling data across a network connection. This introduce runtime lag as well. To correctly benchmark this operation, you should have a DataStage job spool the data directly to a sequential text file without any references or transformation. This will accurately tell you how long it takes DataStage to gather that data from the view.

In addition, not knowing your view SQL, you could be viewing the first fetch set of result data. If there are points in the dataset that you haven't reached, you don't see that the view has performance issues. NEVER base a benchmark on the return of the first row, it's the LAST row that counts.

jreddy wrote: returns about 2.8 million rows, goes thru a transformer (where i also assign a surrogate key to it by using an oracle sequence) and populates the target table.
This is unfortunate. You probably can eliminate this OCI lookup for every single row by simply getting the max surrogate assigned and passing it in as a job parameter. Then, simply keep adding incrementing a stage variable that initializes as that job parameter and you reduce the load on the job. You should also just generate a pure insert file and then bulk load it.
jreddy wrote: Its taking about 8 hrs to run it. I have done the following: set the oracle 'rows per transaction' as 50000, array size at 1000. What else could i do to improve its performance (in reading or writing to database)
Fiddling with knobs won't cut it. You have a fundamental design choice to make, do you stay with your current design or do you break it into smaller, tunable, modular components and use bulk loading.
jreddy wrote: I also had another job that had 23GB flat file that had to be uploaded.. took about 18hrs.. is this normal..
Normal is subjective. Unless you take advantage of job instantiation to divide-n-conquer massive files, unless you use bulk loading, unless you design for bulk loading, unless you modularize and focus job designs on mitigating outside influences to maximize individual tasks... You'll never get your massive volumes loaded quickly, efficiently, with audit trails and restart points.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
jreddy
Premium Member
Premium Member
Posts: 202
Joined: Tue Feb 03, 2004 5:09 pm

Post by jreddy »

record length is about 300bytes average pretty small actually.. but you are right about array size.. though in this case it seems okay. Generally, even i try to use 500 for medium sized files and 1000 when i think its a small record..byte wise..

but apart from these 2 things.. i read about buffer size, changed it to about 512 kb.. got some performance improvement, but still trying to see if there is someting else that could improve it better.. something that you guys have experimented on and got good performance improvement..

thanks for the input..
jreddy
Premium Member
Premium Member
Posts: 202
Joined: Tue Feb 03, 2004 5:09 pm

Post by jreddy »

yes, i do agree that TOAD may have misled me (my query was the exact view defintion that i copied), but in the Datastage job data starts flowing in from the oraoci stage that reads from that view pretty soon, i mean in couple minutes and not hours.. its the processing thats taking so long.

About the suggestion from ken about replacing the lookup on oraoci for oracle sequence with a job parameter+stage variable combi.. i have actually used a ODBC lookup with user-defined sql and wrote a query in terms of select key.nextval from dual and mapping that to the ID on target in transformer.. i am not so sure if that is also the culprit.. Now trying to elaborate on what Ken suggested.. how do we get the value of that sequence where it has to start off from.. from a before job subroutine?? how do we transfer that value to a job parameter..? I shouldn't be expected to provide that oracle sequence value at runtime to that job parameter, right.. please clarify it for me at your convenience...

also, all of you have suggested orabulk instead of oraoci.. i guess its time i check it out as well..to see if it allows my custom sql that i have to write before dumping it into oracle tables.. i have no problem with changing my design if it gives improvement..

about that suggestion to use pure insert.. yes i am sorry that was a typo in my message,,i am using a pure insert and not an upsert..

thanks..
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

jreddy wrote:yes, i do agree that TOAD may have misled me (my query was the exact view defintion that i copied), but in the Datastage job data starts flowing in from the oraoci stage that reads from that view pretty soon, i mean in couple minutes and not hours.. its the processing thats taking so long.
NO. Its the time to get the last row from the query, not the first row. Just because DataStage is streaming rows doesn't mean that all the rows have even been gathered by the database. Your assumption that transformation is the issue is false.
jreddy wrote: About the suggestion from ken about replacing the lookup on oraoci for oracle sequence with a job parameter+stage variable combi.. i have actually used a ODBC lookup with user-defined sql and wrote a query in terms of select key.nextval from dual and mapping that to the ID on target in transformer.. i am not so sure if that is also the culprit.. Now trying to elaborate on what Ken suggested.. how do we get the value of that sequence where it has to start off from.. from a before job subroutine?? how do we transfer that value to a job parameter..? I shouldn't be expected to provide that oracle sequence value at runtime to that job parameter, right.. please clarify it for me at your convenience...
Not a before-job routine. In your job control before running the job simply do your query from the database to get the maximum surrogate key in the table. Using nextval requires you do use a sequence. This is not only inefficient, but if you switch to bulk-loading via DIRECT path you must have the surrogate key already in place in the dataset. Conventional loads have the ability to acquire the max surrogate in use or use a sequence. I prefer a DIRECT load, as it is unbelievably fast.
jreddy wrote: also, all of you have suggested orabulk instead of oraoci.. i guess its time i check it out as well..to see if it allows my custom sql that i have to write before dumping it into oracle tables.. i have no problem with changing my design if it gives improvement..

about that suggestion to use pure insert.. yes i am sorry that was a typo in my message,,i am using a pure insert and not an upsert..

thanks..
If you're loading millions of rows and have never worked with a bulk loader then it is time to learn. I recommend just experimenting with the orabulk stage to see how a control file works. BUT, you should not use this stage as it is slow! You are better off writing the data to a sequential text file and using sqlldr directly and write your own control file.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
jreddy
Premium Member
Premium Member
Posts: 202
Joined: Tue Feb 03, 2004 5:09 pm

Post by jreddy »

yes, i did work with bulk loader before and never used that ever again, coz i thought it was useless, in the sense it was so slow. .. from all your suggestions i thought that i might've been wrong and wanted to learn more about its options and features to improve performance... anyway..

yeah, actually i do agree 100% that using a sqlldr is much better, because couple of years ago thats what i used for data migration... but you must understand that there is also some sql processing that i am doing just before insert and hence i should be able to do that and also the client i work for..has no resource with technical expertise on etl as such, so they have asked us to totally automate the whole processes... so kind of doing everything in datastage itself..


anyway, what i am doing right about now is to make a subset of data that i will access in those joins (from the view) and generate hash files from it, rather than have hash files with those millions of rows of the original table that they will be based on and make the input view just a single join between 2 tables.

thanks guys for the discussion.. but lets keep posting new tips and tricks that would greatly improve performance.. be it design alternatives or stage preferences ...
wdudek
Participant
Posts: 66
Joined: Mon Dec 08, 2003 10:44 am

Post by wdudek »

Do the tables that you are inserting into have indexes on them? If so drop these before teh insert and recreate them afterwards. We move over 10 gigs of data nightly into Oracle and while we are definately pushing the limits of our capacity to do this we can get the job done during the night most of the time now, but has involved a tremendous amount of tuning.
jreddy
Premium Member
Premium Member
Posts: 202
Joined: Tue Feb 03, 2004 5:09 pm

Post by jreddy »

we did drop all the indexes.. the main culprit could be the referential integrity, check, primary key and not null constraints on the target columns though.. which we cannot drop
Post Reply