Page 1 of 1

better way to take care for restartability while loading.

Posted: Wed May 02, 2007 3:48 am
by Ratan Babu N
Hi,

In my job i m reading from a dataset and loading it into a table.
Source consists of 2 Million records and I have only inserts as the data is transactional data, inorder to take care for the restartability(Say if the job abends in the middle after loading 1 million records) which of the following is the better approach

i) performing a Left outer join with the Warehouse table and then filtering out the existing records and inserting only the new records which were not inserted in the previous run.

ii) Using the Upsert option so that the existing records will be updated and new records will be inserted.

Please let me know if any better option (performance wise)is there.

Posted: Wed May 02, 2007 3:57 am
by kumar_s
Upsert will be much slower, if your target table keeps growing.
You can always load the data in certain order (sort) based on some key. So that each time you load, you can find the max of available key value in the target and load from that point from the source.

Posted: Wed May 02, 2007 4:14 am
by Ratan Babu N
kumar_s wrote:Upsert will be much slower, if your target table keeps growing.
You can always load the data in certain order (sort) based on some key. So that each time you load, you can find the max of available key value in the target and load from that point from the source.
But i dont have a Surrogate key in my table and no column in my source records carries a running number. since Upsert is slower then can i go for option of joining with the Ware house table (Table consists of million records)

Posted: Wed May 02, 2007 7:07 am
by ivannavi
Add a date column to your target table. Fill it with the date of load during the load. If the load aborts in the middle, you can easily identify the records to be deleted. After deletion you can run the job again.

You could try to automate this process by adding a delete job in sequence before the load job(do not checkpoint run). If the date column is indexed then in normal circumstances it will be only a few moments before the delete realizes there is nothing to delete.

All this applies if your load is a daily increment, not more than once a day.

Posted: Wed May 02, 2007 6:46 pm
by kumar_s
If the job is just for load and ff you data is just a million, you can commit after all the transaction. '0'.
So that if you job aborts in middle, you can restart from scratch. If you need to maintain row level commit, you can have some batch no for each run so that it make you job simpler, by just deleting those records with the current batch.
If you could not follow the either approach, you can follow the upsert.
By the way, how could you insert the rows which are not available in Target using approach 1? Using any cursors?

Posted: Thu May 03, 2007 1:32 am
by ivannavi
Hey, this is great news. I knew about the flasback query where you could specify the time interval (how far in the past). I didn't know about the SCN and flashback the table directly.

But how about this:
ORA-12801: error signaled in parallel query server P011
ORA-01555: snapshot too old: rollback segment number 4 with name "_SYSSMU4$" too small
or this:
ORA-12801: error signaled in parallel query server P000
ORA-08180: no snapshot found based on specified time
What is your experience? Can I expect my DBA to setup things such that flashback is possible after many milions of records changed/inserted?

Posted: Thu May 03, 2007 1:59 am
by nick.bond
In theory there is no limit to the number of changes to a table you can flashback, but for this to be possible Oracle needs to write a log of all the changes to the rollback segment, so there is a space issue. As long as your DBA can increase the parameters enough you will be ok. Off the top of my head I don't know what they are but a quick Google for Oracle Flashback will tell you. There is one for time and one for space. I imagine you problem will be for space at the moment.

If you are only performing inserts, you may be better going for a solution similar to that suggested earlier, adding either a load_date column or a etl_id column to the table that will store an identified for all the records inserted in a particular run. You can then use various methods to remove these records after an abort. Two of which would be:
  • Have an after job subroutine call either a script or a routine that will run a delete statement taking the tablename and etl_id/load_date as parameters. You tick a box so these after job subroutines only run if the job fails.

    have a branch in the Sequence which you enter when the job aborts and perform the delete there.
One thing to remember with both of these is that if the job aborts because database connectivity goes down, the delete most likely won't work either, so manual intervention would be required, which is an argument for having a check on the start of the next run.

Posted: Fri May 04, 2007 12:38 am
by Ratan Babu N
The Database is Db2udb 8.x. Is there any flash back option in Db2?

If we go for the option of having a date column or Etl_id column (suggested by Nick bond) to identify the records for a particular run and delete them if the job aborts in the middle, Is it a good idea to perform a delete operation on a table of 100 million records and growing by 2 to 4 millions?

Posted: Fri May 04, 2007 12:58 am
by nick.bond
Ratan Babu N wrote:The Database is Db2udb 8.x. Is there any flash back option in Db2?

If we go for the option of having a date column or Etl_id column (suggested by Nick bond) to identify the records for a particular run and delete them if the job aborts in the middle, Is it a good idea to perform a delete operation on a table of 100 million records and growing by 2 to 4 millions?
In terms of DB2 Flashback type functionality I don't know. Try Googling it.

As long as the field is indexed the delete shouldn't be to much of a problem, it will take a while but hopefully you won't expect this to happen to often.

Another thing to consider is partitioning the table by this field and then you can just delete the partition.

Posted: Fri May 04, 2007 1:31 am
by Ratan Babu N
As we are maintaining 7 years of data, I feel Partitioning on this column is not a good idea and deletion of data by creating a nonunique index on this column is a good approach (hoping this wont occur too frequently)

Re: better way to take care for restartability while loading

Posted: Mon Jun 11, 2007 11:34 pm
by dearali
We used to cature last run time of job(DSJobStartTimestamp) and job name in hashed file. and used routine to pass it as parameter in where cluase whenever we run for next time ,so that it loads only incremental data.I think same logic can be applied to this as well.

regards
Mahammad Ali