Ok this is pretty long winded but please stick with me.
I have been requested to see if its possible to optimise a current job in order to make it run faster. Currently in Production the job is taking approx 3 hours to complete.
Job Overview:
Records inserted into a database currently have overlapping date records eg
Element Start Date End Date
A 01/08/2006 31/07/2007
A 01/01/2007 31/12/2007
A 01/01/2008 31/12/2008
The key of this data is Element and Start Date.
The job takes this data and removes the overlap by taking the lowest start date and highest end date, ONLY WHERE THERE IS AN OVERLAP. So the output is:
Element Start Date End Date
A 01/08/2006 31/12/2007
A 01/01/2008 31/12/2008
To accomplish this the job looks like this
Code: Select all
OraDB.
Tbl1 --> Trf1 --> SeqFile1 --> Sort1 --> Trf2 --Ins--> OraDB.Tbl1
| \
| \--Upd--> OraDB.Tbl1
|
luOraDB.
tbl1
Process:
1. Data read from OracleDB. Sort on Element asc, Start Date asc, end date desc
2. Timestamp added at Trf1 (required in target table for other processing not related to this job)
3. Data loaded into Sequential file (mainly for disaster recovery should something go wrong)
4. Data is further sorted by the key so that the values are processed in order (especially where something has gone wrong and further data may have been appended to SeqFile1)
5. At luTbl1, Tbl1 is truncated to remove all existing data (which is now in SeqFile1). The qry for the luTbl1 is to select Element, max(Start Date), end date (ie pull back highest Start date for a particular element.
6. Trf2 then determines if the record is an insert or an update to Tbl1, by comparing it to data which has been loaded into Tbl1 already.
ie The first record would be an insert as there is nothing in the table. Array size and Transaction size in the target DB are both 1 so the record is committed immediately. The second record will be compared to the luTbl1 which will now have 1 record in it. Stage variables determine whether the incoming record an insert or an update
- Insert = IsNull(lu.Element) OR incoming.START_DATE > lu.END_DATE
Update = Not(Insert) AND incoming.END_DATE > lu.END_DATE
If its an update the record is updated with the incoming expiry date
So using the data above:
1. The first record is inserted into Tbl1 (as a match won't be found on the element in lookup as its empty)
2. For the second record a match is found in the lookup on the element (A) with a Start_date (01/01/2007) < lu.End_date (31/07/2007), And the second record's end date (31/12/2007) is > the lookup end date (31/07/2007) therefore update the end date in Tbl1 and commited
The last record's start date is > the lookup End date and is therefore a new record and inserted accordingly.
Ok long winded explanation I know, but I am trying to be detailed here.
The question is, "Any ideas on ways of making this job run faster?"
The input table has approx 1.7 million records.
Actions taken to this point have been to increase the Max Rows and Max Open Files on the sort stage to 50000 and 20 respectively. We have also added a partition stage between Tfr1 and SeqFile1 to split the stream (ensuring that records with the same element follow the same stream by using modulus) and repeating the current process on each stream to share the load. This shaved about 30mins to an hour off processing.
My initial thoughts are that the slow speed is caused by the sort stages but these are required to ensure the earliest start dates are loaded first. Is it possible that it is fact caused at the DB level?
Anybody got any suggestions - you don't have to get too detailed, we have the personnel, we are just out of ideas. Or is this more of a hire a better expert type problem?
Thanks in advance for any suggestions
Regards
Shane Muir