better way to take care for restartability while loading.

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
Ratan Babu N
Participant
Posts: 34
Joined: Tue Dec 27, 2005 12:13 am

better way to take care for restartability while loading.

Post by Ratan Babu N »

Hi,

In my job i m reading from a dataset and loading it into a table.
Source consists of 2 Million records and I have only inserts as the data is transactional data, inorder to take care for the restartability(Say if the job abends in the middle after loading 1 million records) which of the following is the better approach

i) performing a Left outer join with the Warehouse table and then filtering out the existing records and inserting only the new records which were not inserted in the previous run.

ii) Using the Upsert option so that the existing records will be updated and new records will be inserted.

Please let me know if any better option (performance wise)is there.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Upsert will be much slower, if your target table keeps growing.
You can always load the data in certain order (sort) based on some key. So that each time you load, you can find the max of available key value in the target and load from that point from the source.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
Ratan Babu N
Participant
Posts: 34
Joined: Tue Dec 27, 2005 12:13 am

Post by Ratan Babu N »

kumar_s wrote:Upsert will be much slower, if your target table keeps growing.
You can always load the data in certain order (sort) based on some key. So that each time you load, you can find the max of available key value in the target and load from that point from the source.
But i dont have a Surrogate key in my table and no column in my source records carries a running number. since Upsert is slower then can i go for option of joining with the Ware house table (Table consists of million records)
ivannavi
Premium Member
Premium Member
Posts: 120
Joined: Mon Mar 07, 2005 9:49 am
Location: Croatia

Post by ivannavi »

Add a date column to your target table. Fill it with the date of load during the load. If the load aborts in the middle, you can easily identify the records to be deleted. After deletion you can run the job again.

You could try to automate this process by adding a delete job in sequence before the load job(do not checkpoint run). If the date column is indexed then in normal circumstances it will be only a few moments before the delete realizes there is nothing to delete.

All this applies if your load is a daily increment, not more than once a day.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

If the job is just for load and ff you data is just a million, you can commit after all the transaction. '0'.
So that if you job aborts in middle, you can restart from scratch. If you need to maintain row level commit, you can have some batch no for each run so that it make you job simpler, by just deleting those records with the current batch.
If you could not follow the either approach, you can follow the upsert.
By the way, how could you insert the rows which are not available in Target using approach 1? Using any cursors?
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
ivannavi
Premium Member
Premium Member
Posts: 120
Joined: Mon Mar 07, 2005 9:49 am
Location: Croatia

Post by ivannavi »

Hey, this is great news. I knew about the flasback query where you could specify the time interval (how far in the past). I didn't know about the SCN and flashback the table directly.

But how about this:
ORA-12801: error signaled in parallel query server P011
ORA-01555: snapshot too old: rollback segment number 4 with name "_SYSSMU4$" too small
or this:
ORA-12801: error signaled in parallel query server P000
ORA-08180: no snapshot found based on specified time
What is your experience? Can I expect my DBA to setup things such that flashback is possible after many milions of records changed/inserted?
nick.bond
Charter Member
Charter Member
Posts: 230
Joined: Thu Jan 15, 2004 12:00 pm
Location: London

Post by nick.bond »

In theory there is no limit to the number of changes to a table you can flashback, but for this to be possible Oracle needs to write a log of all the changes to the rollback segment, so there is a space issue. As long as your DBA can increase the parameters enough you will be ok. Off the top of my head I don't know what they are but a quick Google for Oracle Flashback will tell you. There is one for time and one for space. I imagine you problem will be for space at the moment.

If you are only performing inserts, you may be better going for a solution similar to that suggested earlier, adding either a load_date column or a etl_id column to the table that will store an identified for all the records inserted in a particular run. You can then use various methods to remove these records after an abort. Two of which would be:
  • Have an after job subroutine call either a script or a routine that will run a delete statement taking the tablename and etl_id/load_date as parameters. You tick a box so these after job subroutines only run if the job fails.

    have a branch in the Sequence which you enter when the job aborts and perform the delete there.
One thing to remember with both of these is that if the job aborts because database connectivity goes down, the delete most likely won't work either, so manual intervention would be required, which is an argument for having a check on the start of the next run.
Regards,

Nick.
Ratan Babu N
Participant
Posts: 34
Joined: Tue Dec 27, 2005 12:13 am

Post by Ratan Babu N »

The Database is Db2udb 8.x. Is there any flash back option in Db2?

If we go for the option of having a date column or Etl_id column (suggested by Nick bond) to identify the records for a particular run and delete them if the job aborts in the middle, Is it a good idea to perform a delete operation on a table of 100 million records and growing by 2 to 4 millions?
nick.bond
Charter Member
Charter Member
Posts: 230
Joined: Thu Jan 15, 2004 12:00 pm
Location: London

Post by nick.bond »

Ratan Babu N wrote:The Database is Db2udb 8.x. Is there any flash back option in Db2?

If we go for the option of having a date column or Etl_id column (suggested by Nick bond) to identify the records for a particular run and delete them if the job aborts in the middle, Is it a good idea to perform a delete operation on a table of 100 million records and growing by 2 to 4 millions?
In terms of DB2 Flashback type functionality I don't know. Try Googling it.

As long as the field is indexed the delete shouldn't be to much of a problem, it will take a while but hopefully you won't expect this to happen to often.

Another thing to consider is partitioning the table by this field and then you can just delete the partition.
Regards,

Nick.
Ratan Babu N
Participant
Posts: 34
Joined: Tue Dec 27, 2005 12:13 am

Post by Ratan Babu N »

As we are maintaining 7 years of data, I feel Partitioning on this column is not a good idea and deletion of data by creating a nonunique index on this column is a good approach (hoping this wont occur too frequently)
dearali
Participant
Posts: 1
Joined: Mon Jun 11, 2007 8:53 pm

Re: better way to take care for restartability while loading

Post by dearali »

We used to cature last run time of job(DSJobStartTimestamp) and job name in hashed file. and used routine to pass it as parameter in where cluase whenever we run for next time ,so that it loads only incremental data.I think same logic can be applied to this as well.

regards
Mahammad Ali
Post Reply