Data Rationalisation - Performance and Optimisation

ShaneMuir · Post by **ShaneMuir** » Fri Aug 18, 2006 12:27 am

Hi Everyone,

Ok this is pretty long winded but please stick with me.

I have been requested to see if its possible to optimise a current job in order to make it run faster. Currently in Production the job is taking approx 3 hours to complete.

Job Overview:
Records inserted into a database currently have overlapping date records eg

Element Start Date End Date
A 01/08/2006 31/07/2007
A 01/01/2007 31/12/2007
A 01/01/2008 31/12/2008

The key of this data is Element and Start Date.

The job takes this data and removes the overlap by taking the lowest start date and highest end date, ONLY WHERE THERE IS AN OVERLAP. So the output is:

Element Start Date End Date
A 01/08/2006 31/12/2007
A 01/01/2008 31/12/2008

To accomplish this the job looks like this

Code: Select all

OraDB.
Tbl1 --> Trf1 --> SeqFile1 --> Sort1 --> Trf2 --Ins--> OraDB.Tbl1
                                         |  \
                                         |   \--Upd-->  OraDB.Tbl1
                                         |
                                     luOraDB.
                                     tbl1

Assumption: All references to OraDB Tbl1 refer to the same table (luTbl1 is a lookup of that table)

Process:
1. Data read from OracleDB. Sort on Element asc, Start Date asc, end date desc
2. Timestamp added at Trf1 (required in target table for other processing not related to this job)
3. Data loaded into Sequential file (mainly for disaster recovery should something go wrong)
4. Data is further sorted by the key so that the values are processed in order (especially where something has gone wrong and further data may have been appended to SeqFile1)
5. At luTbl1, Tbl1 is truncated to remove all existing data (which is now in SeqFile1). The qry for the luTbl1 is to select Element, max(Start Date), end date (ie pull back highest Start date for a particular element.

6. Trf2 then determines if the record is an insert or an update to Tbl1, by comparing it to data which has been loaded into Tbl1 already.
ie The first record would be an insert as there is nothing in the table. Array size and Transaction size in the target DB are both 1 so the record is committed immediately. The second record will be compared to the luTbl1 which will now have 1 record in it. Stage variables determine whether the incoming record an insert or an update

Insert = IsNull(lu.Element) OR incoming.START_DATE > lu.END_DATE
Update = Not(Insert) AND incoming.END_DATE > lu.END_DATE

If its an insert the incoming record is inserted as is into OraDB.Tbl1
If its an update the record is updated with the incoming expiry date

So using the data above:
1. The first record is inserted into Tbl1 (as a match won't be found on the element in lookup as its empty)
2. For the second record a match is found in the lookup on the element (A) with a Start_date (01/01/2007) < lu.End_date (31/07/2007), And the second record's end date (31/12/2007) is > the lookup end date (31/07/2007) therefore update the end date in Tbl1 and commited
The last record's start date is > the lookup End date and is therefore a new record and inserted accordingly.

Ok long winded explanation I know, but I am trying to be detailed here.
The question is, "Any ideas on ways of making this job run faster?"

The input table has approx 1.7 million records.
Actions taken to this point have been to increase the Max Rows and Max Open Files on the sort stage to 50000 and 20 respectively. We have also added a partition stage between Tfr1 and SeqFile1 to split the stream (ensuring that records with the same element follow the same stream by using modulus) and repeating the current process on each stream to share the load. This shaved about 30mins to an hour off processing.

My initial thoughts are that the slow speed is caused by the sort stages but these are required to ensure the earliest start dates are loaded first. Is it possible that it is fact caused at the DB level?

Anybody got any suggestions - you don't have to get too detailed, we have the personnel, we are just out of ideas. Or is this more of a hire a better expert type problem?

Thanks in advance for any suggestions

Regards
Shane Muir

chulett · Post by **chulett** » Fri Aug 18, 2006 7:23 am

ShaneMuir wrote:My initial thoughts are that the slow speed is caused by the sort stages but these are required to ensure the earliest start dates are loaded first. Is it possible that it is fact caused at the DB level?

That should be readily apparent... how much of the 3 hours are spent in the Sort stage? Once records start being delivered by the Sort stage, how much time do the Oracle operations take?

Your other performace killer is your choice of array and transaction size. Why the need to commit every record as it is written? The lookup? Ditch it. Consider this:

Use one OCI stage for both insert and update links. This makes all processing occur as one unit of work. Switch to a hashed file for the lookup, making sure it is not cached to memory. Include another link from Trf2 to another hashed file stage that points to the same hashed file as the lookup. Whenever you write a record to Oracle, also write it to the hashed file so it will be available in the lookup.

That should help the latter portion of your job... alot.

chulett · Post by **chulett** » Fri Aug 18, 2006 8:01 am

Oh... and if you do this, open up both the Array and Transaction sizes. If your DB is configured to handle it, consider using a zero transaction size so that restarts don't become a complicated issue.

ShaneMuir · Post by **ShaneMuir** » Sun Aug 20, 2006 5:18 pm

chulett wrote:
ShaneMuir wrote:My initial thoughts are that the slow speed is caused by the sort stages but these are required to ensure the earliest start dates are loaded first. Is it possible that it is fact caused at the DB level?
That should be readily apparent... how much of the 3 hours are spent in the Sort stage? Once records start being delivered by the Sort stage, how much time do the Oracle operations take?

I was probably a little wishy washy with the statement above - I knew it was the sort stage that was the biggest hold up. I was a little out with my 3 hours, its actually 2.5, but of that it looks like about 2h is spent in the sort stage.

chulett wrote:Your other performace killer is your choice of array and transaction size. Why the need to commit every record as it is written? The lookup? Ditch it. Consider this:

Use one OCI stage for both insert and update links. This makes all processing occur as one unit of work. Switch to a hashed file for the lookup, making sure it is not cached to memory. Include another link from Trf2 to another hashed file stage that points to the same hashed file as the lookup. Whenever you write a record to Oracle, also write it to the hashed file so it will be available in the lookup.

Thanks for the idea. Will see how it goes and let everybody here know.

chulett · Post by **chulett** » Sun Aug 20, 2006 8:17 pm

Ok, more things to consider to help with the sorting. The Sort stage is good for small volumes but is pretty darn slow for anything remotely considered 'large'. Let your operating system do the sort for you, and leverage it via the Sequential file stage you already have in the job.

Remove the Sort stage and enable the Filter option in the Sequential File stage. For the Filter command, pass the file you've just landed to the UNIX 'sort' command with the appropriate key definitions to sort it appropriately. The stage will then read standard out - the output of the sort command - as if it were a physical file and feed sorted data to the rest of your job. Actually, your landed file goes in the Filename box and the sort command in the Filter box and the stage will automatically handle the redirection of standard in and standard out.

Hard to predict the effect of this, but I've generally seen a 10x increase in sort speed over the use of the dedicated stage. Best way to get this right is land the file and manually execute the sort command, redirecting it to another filename. Keep at it until the output is correct, then transfer the sort command into your job. I don't remember the specifics (a search should turn them up as this has been discussed before) but usually include the options to force 'maximum memory' and to redirect temp files off to a specific location - like /tmp.

Good luck!

ShaneMuir · Post by **ShaneMuir** » Sun Aug 20, 2006 11:08 pm

Ahh of course. Completely slipped my mind that we could sort it in Unix first.

Thanks again for your help on this Craig. I hope I can repay the favour one day (if not to you then to others on this forum).

ShaneMuir · Post by **ShaneMuir** » Sun Aug 20, 2006 11:19 pm

Ahh of course. Completely slipped my mind that we could sort it in Unix first.

Thanks again for your help on this Craig. I hope I can repay the favour one day (if not to you then to others on this forum).

ray.wurlod · Post by **ray.wurlod** » Mon Aug 21, 2006 1:00 am

You can also go to the Tracing tab when submitting a job run request and ask for statistics to be gathered for each active stage in the job. This can yield detailed timing and therefore allow you to further diagnose the "hot spots" within your design.

DSguru2B · Post by **DSguru2B** » Mon Aug 21, 2006 1:16 pm

Another piece of advice. Split up your job. Modularize it. It might not speed up things but will sure make the jobs more readable, restartablity will be easier too. Plus you will get a big help in the unix sort.

ShaneMuir · Post by **ShaneMuir** » Mon Aug 28, 2006 6:46 pm

Just an update:

It turns out it wasn't the sort stage but rather the database load that was taking the majority of the time. Implementing the hash file solution has cut loading time in half. Also we set the transaction size to 0.

We have also tested with the Unix sort and it does give about a threefold increase in speed taking only 2-3 mins now.

One question though regarding the below statement

Oh... and if you do this, open up both the Array and Transaction sizes. If your DB is configured to handle it, consider using a zero transaction size so that restarts don't become a complicated issue.

Will increasing the arrary size mean that some records could be missed for an update because they both come through in the same chunk of records? Or is each record written to the hash file and available immediately for reference regardless of the array size used in the database?

chulett · Post by **chulett** » Mon Aug 28, 2006 7:18 pm

ShaneMuir wrote:Will increasing the arrary size mean that some records could be missed for an update because they both come through in the same chunk of records? Or is each record written to the hash file and available immediately for reference regardless of the array size used in the database?

The latter. The hashed file writes happen immediately, the only thing that might cause an issue with that is if you have write caching turned on.

The Array Size affects the OCI stage only, and bundles up the rows into a 'packet' based on the array size before actually making the trip to the database. This without affecting the behaviour of the rest of the job and stages.

Something to be aware of is Reject Row handling with an Array Size of anything other than 1. With it set to 1, everything functions as you'd expect. Increase the number and behaviour changes. The number of rejects noted as being rejected and the record values in the job's log will be correct but what is actually captured down the reject link will be wrong. From my testing, the last row in the array was always what was rejected, regardless of the number of records in the array that ended up rejected.

ShaneMuir · Post by **ShaneMuir** » Mon Aug 28, 2006 11:36 pm

chulett wrote:
ShaneMuir wrote:Will increasing the arrary size mean that some records could be missed for an update because they both come through in the same chunk of records? Or is each record written to the hash file and available immediately for reference regardless of the array size used in the database?
The latter. The hashed file writes happen immediately, the only thing that might cause an issue with that is if you have write caching turned on.

The Array Size affects the OCI stage only, and bundles up the rows into a 'packet' based on the array size before actually making the trip to the database. This without affecting the behaviour of the rest of the job and stages.

Thanks for clearing that up for me. That always seem the logical conclusion but I thought I would check

chulett wrote: Something to be aware of is Reject Row handling with an Array Size of anything other than 1. With it set to 1, everything functions as you'd expect. Increase the number and behaviour changes. The number of rejects noted as being rejected and the record values in the job's log will be correct but what is actually captured down the reject link will be wrong. From my testing, the last row in the array was always what was rejected, regardless of the number of records in the array that ended up rejected.

I remember reading that somewhere else and there were suggestions that having the update action sent to update/insert that the rejects are handled as they should be. Anyway so basically as I require the rejectedcode output link variable to be correct I should continue to use an array size of 1.

Thanks everybody for you input into this. It is most appreciated.

ShaneMuir · Post by **ShaneMuir** » Tue Oct 10, 2006 7:29 pm

Just for everybody's information.

The changes suggested (ie use hash files and unix sorts) have sped the job up from 3 hours to 35 mins. (In initial testing anyway put its a very promising sign!! :D )

Thanks again to everybody who put forward suggestions.

chulett · Post by **chulett** » Tue Oct 10, 2006 7:42 pm

Excellent. Was wondering how this turned out...

DSXchange

Data Rationalisation - Performance and Optimisation

Data Rationalisation - Performance and Optimisation

Re: Data Rationalisation - Performance and Optimisation

Re: Data Rationalisation - Performance and Optimisation