Insert records from where it last failed
Moderators: chulett, rschirm, roy
Insert records from where it last failed
Is there a possibility that a parallel job can pick the records from where the job last failed inserting to a database without actually looking up on the target table or just updating the old record?
What if the job aborted after inserting 900,000 records and committed. The job still has 100,000 rows to be inserted.
Does "Running from point of failure" achieve this?
This doubt was expressed lot of times everywhere but without a satisfactory answer.
Thanks,
DSX User.
What if the job aborted after inserting 900,000 records and committed. The job still has 100,000 rows to be inserted.
Does "Running from point of failure" achieve this?
This doubt was expressed lot of times everywhere but without a satisfactory answer.
Thanks,
DSX User.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
-
- Participant
- Posts: 18
- Joined: Tue Dec 25, 2012 10:39 pm
- Location: Detroit,MI
Hi Ray,
Will this be a possible solution to eliminate the issue of re insertion / updation even a job loading huge volume fails.
Job 1:
1. Partition the source data using round robin partitioning method into different datasets. Perferably to the number of nodes available to make even division.
Job 2:
This job will run in a loop which reads data from the data sets one after another and load the data the target. The file name will be stored in a temporary file for every loop. If the job fails we will resume only from the last failed file, hence eliminating re updating the data which is already loaded.
Kindly suggest if this will work out?
Will this be a possible solution to eliminate the issue of re insertion / updation even a job loading huge volume fails.
Job 1:
1. Partition the source data using round robin partitioning method into different datasets. Perferably to the number of nodes available to make even division.
Job 2:
This job will run in a loop which reads data from the data sets one after another and load the data the target. The file name will be stored in a temporary file for every loop. If the job fails we will resume only from the last failed file, hence eliminating re updating the data which is already loaded.
Kindly suggest if this will work out?
Harish
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
-
- Participant
- Posts: 18
- Joined: Tue Dec 25, 2012 10:39 pm
- Location: Detroit,MI
We load this in a sequence with Checkpointing set.
Job 1 is loaded only once per load, hence the data extraction happens only once from the source.
Then Job 2 will run in Loop.
If for example DS1, DS2, DS3 and DS4 were created and if DS1 and DS2 were loaded and job failes during DS3 Load then we have to reload only DS3 and then DS4.
For DS3 any ways we have to reload the data from the beginning.
We can save reloading time for DS1 and DS2.
This is a suggestible approach only if the data volume is very huge.
Job 1 is loaded only once per load, hence the data extraction happens only once from the source.
Then Job 2 will run in Loop.
If for example DS1, DS2, DS3 and DS4 were created and if DS1 and DS2 were loaded and job failes during DS3 Load then we have to reload only DS3 and then DS4.
For DS3 any ways we have to reload the data from the beginning.
We can save reloading time for DS1 and DS2.
This is a suggestible approach only if the data volume is very huge.
Harish
-
- Participant
- Posts: 117
- Joined: Wed Feb 06, 2013 9:24 am
- Location: Chennai,TN, India
Is it going to insert always?
Do you have any surrogate key or increasing sequence number in the table? If that is the case get the max number before load and have a filter in between dataset and connector stating key greater than max.
And what is the database that you are loading? Does it have table partition?
Do you have any surrogate key or increasing sequence number in the table? If that is the case get the max number before load and have a filter in between dataset and connector stating key greater than max.
And what is the database that you are loading? Does it have table partition?
Thanks,
Prasanna
Prasanna
Yes. But You still need a mechanism to find out, which records from DS3 were already loaded to the database and which were not. So you either have to query the target and use change-capture or you have to "remember" the last sort-key-value loaded successfully to the database and load only values higher than that.harishkumar.upadrasta wrote:If for example DS1, DS2, DS3 and DS4 were created and if DS1 and DS2 were loaded and job failes during DS3 Load then we have to reload only DS3 and then DS4.
"It is not the lucky ones are grateful.
There are the grateful those are happy." Francis Bacon
There are the grateful those are happy." Francis Bacon
You don't mention the volume of data and that has a part in designing the solution.
You could always create a dataset which contains ONLY KEYS for the data being inserted.
When the job loads, load the keys as a lookup and then perform a lookup on the table. If the key exists, ignore the row. If it does not exist, perform an insert.
Of course I would much rather examine why your job is failing enough that you need to be concerned about building this type of logic into the job.
You could always create a dataset which contains ONLY KEYS for the data being inserted.
When the job loads, load the keys as a lookup and then perform a lookup on the table. If the key exists, ignore the row. If it does not exist, perform an insert.
Of course I would much rather examine why your job is failing enough that you need to be concerned about building this type of logic into the job.