Insert records from where it last failed

dsxchuser · Post by **dsxchuser** » Tue Mar 19, 2013 3:32 pm

Is there a possibility that a parallel job can pick the records from where the job last failed inserting to a database without actually looking up on the target table or just updating the old record?

What if the job aborted after inserting 900,000 records and committed. The job still has 100,000 rows to be inserted.

Does "Running from point of failure" achieve this?
This doubt was expressed lot of times everywhere but without a satisfactory answer.

Thanks,
DSX User.

ray.wurlod · Post by **ray.wurlod** » Tue Mar 19, 2013 6:26 pm

Welcome aboard.

The short answer is "No".

You have to design in a mechanism to "remember" which rows were inserted if you don't want to query the target.

harishkumar.upadrasta · Tue Mar 19, 2013 10:38 pm

Hi Ray,

Will this be a possible solution to eliminate the issue of re insertion / updation even a job loading huge volume fails.

Job 1:

1. Partition the source data using round robin partitioning method into different datasets. Perferably to the number of nodes available to make even division.

Job 2:

This job will run in a loop which reads data from the data sets one after another and load the data the target. The file name will be stored in a temporary file for every loop. If the job fails we will resume only from the last failed file, hence eliminating re updating the data which is already loaded.

Kindly suggest if this will work out?

ray.wurlod · Post by **ray.wurlod** » Tue Mar 19, 2013 10:41 pm

No, because that's not how Round Robin partitioning works. In any case, what happens if the failure occurs half way through a Data Set?

harishkumar.upadrasta · Tue Mar 19, 2013 10:45 pm

We load this in a sequence with Checkpointing set.

Job 1 is loaded only once per load, hence the data extraction happens only once from the source.

Then Job 2 will run in Loop.

If for example DS1, DS2, DS3 and DS4 were created and if DS1 and DS2 were loaded and job failes during DS3 Load then we have to reload only DS3 and then DS4.

For DS3 any ways we have to reload the data from the beginning.
We can save reloading time for DS1 and DS2.

This is a suggestible approach only if the data volume is very huge.

prasannakumarkk · Post by **prasannakumarkk** » Tue Mar 19, 2013 11:34 pm

Is it going to insert always?
Do you have any surrogate key or increasing sequence number in the table? If that is the case get the max number before load and have a filter in between dataset and connector stating key greater than max.

And what is the database that you are loading? Does it have table partition?

BI-RMA · Post by **BI-RMA** » Wed Mar 20, 2013 5:02 am

harishkumar.upadrasta wrote:If for example DS1, DS2, DS3 and DS4 were created and if DS1 and DS2 were loaded and job failes during DS3 Load then we have to reload only DS3 and then DS4.

Yes. But You still need a mechanism to find out, which records from DS3 were already loaded to the database and which were not. So you either have to query the target and use change-capture or you have to "remember" the last sort-key-value loaded successfully to the database and load only values higher than that.

Post by **daignault** » Wed Mar 20, 2013 6:43 am

You don't mention the volume of data and that has a part in designing the solution.

You could always create a dataset which contains ONLY KEYS for the data being inserted.

When the job loads, load the keys as a lookup and then perform a lookup on the table. If the key exists, ignore the row. If it does not exist, perform an insert.

Of course I would much rather examine why your job is failing enough that you need to be concerned about building this type of logic into the job.