want to know the count of last record processed

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
sumitsaraswat
Participant
Posts: 5
Joined: Mon Jan 24, 2005 11:53 pm

want to know the count of last record processed

Post by sumitsaraswat »

Hi
we want to incorparate the Restartability concept in our job
ie the job should start from where it has left loading

example iam using the transformer stage to do lookup and some transformations and load in OCI stage .if the database goes down due to some reason .i want the transformer to start continuing from where it has left.as lookup will take a lot of time and i dont want to start from the begining.
i also want to throw the value of records processed till database went down(say some count like 250 records processed).i will be storing this info in my audit table

any idea how to execute this?
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

The problem with your request is the concept of a COMMIT setting. If you're committing every 10K rows, blowing up on row 19876 rolls you back to row 10000. So, you are either going to take a performance penalty for commit every row, or, complicate your restart.

But furthering this idea, if you have an open SELECT from a source database feeding into a target table and it blows up, how are you going to re-start and ignore the rows already processed???? Row count only matters if your data is always ORDER BY'd and you can guarantee you get the same exact rows in the same exact order, then you can skip ahead. You'll incur re-SELECTing all the same data, then throwing it away as it hits your first DS transformer.

I can keep going on about this. The idea is to SELECT and capture all source data to a staging file, such as a text file. This preserves your source data set and allows any restart to begin from here. Then, transform your data and build a load ready set of INSERT-only and UPDATE-only load files. Now, your INSERTS are inserted and can be restarted however needed, using INSERT else UPDATE SQL. Since your data will always be inserts it will take this route first, and only update in the event of a restart. You can put skip logic here to ignore rows prior to a certain rowcount if necessary. Your UPDATES can be handled the same way.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
sumitsaraswat
Participant
Posts: 5
Joined: Mon Jan 24, 2005 11:53 pm

Post by sumitsaraswat »

You can put skip logic here to ignore rows prior to a certain rowcount if necessary.
what would be that skip logic is my question??
there are two cases
1. in first case iam transforming and keying the records and putting them in staging table
2.in second case iam checking for any duplicates and putting the unique records in target table

i need the akip logic for both as iam doing lookup for both cases and it does not makes my process efficient if iam doing this lookup again for processed records just because my database got down in between

i want the process to start from where it left..

even if i give a commit say after each 1000 records .and database goes down say after 1100 records have been processed.my 1000 records are safe .only the 100 after that are not loaded cause database went down.....
my question is how will my porocess know from which record to start processing(in this case 1001'st record)
ie if t
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

You are going to either write every rownumber of your input stream (BAD IDEA UNLESS YOUR DATA IS EXPLICITLY ORDERED per my previous comments) to a secure location, such as a hash file, or write the last rownumber to a secure location, such as a hash file. In the event of failure this hash file can contain either all row numbers, or just the last one, written up to the last failure. Now, you can reference this hash file and either throw away found rows matching on row number, or throw away all rows whose row number is less than the one in the hash file.

I tried to illuminate the problems with your approach in general, as I feel it is not a good choice. By following the approach I outlined, you allow yourself to NOT CARE about the rownumber if all you are doing is loading a file of pure inserts. This means you can use the tools built into a bulk loader to manage a restarted load. You really need to read Ralph Kimballs ETL Toolkit book, he covers this concept well in a chapter about landing data during the ETL process to facilitate restarts.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
talk2shaanc
Charter Member
Charter Member
Posts: 199
Joined: Tue Jan 18, 2005 2:50 am
Location: India

Restart Logic

Post by talk2shaanc »

If I have understood the requirement properly, then taking a case, where there are 1million records coming from source say flat file. Now while loading into a table the load breaks at say 25000, I have set a commit point every 20K records. Now the second time when i restart the job i want to start from 20K+1. I think he is not concerened if the target has simple insert or its insert/update. He just wants to start reading from file from 20K+1 and do the processing.
Approach 1.This can be done using SQLLDR(Blk Load Stage In DS) very easily. Things To be Done:
a. Every Time you invoke the Sqlldr, create a physical log file, with timestamp and some uniqueid which would indicate that it belongs to todays Load. Sqlldr has statistics of load: like "records read" , "records loaded","Records rejected".
b.Now before starting the Sqlldr, loop through the log and calculate "records processed"="records loaded"+"Records rejected". Mind it do not take "records read", you will end up having a value 25K.
c. Now while starting the Sqlldr, set the value of SKIP option="records processed"
Sqlldr will then read records from 20K+1
Approach2:
Step1: Move the source file into a hash file,sorted , in a pre-process(say a separate job).Assign proper key, to identify each records uniquely.
Step2:Now read from the hash file and load into the your target. At the same time, keep deleteing the records loaded in your target, from same hash file.
Step3:Add both the job in a job sequence, with check point restart.
Now when its the first time say 1 million records in the hash file is loaded in my first job. the second job in the sequence starts and say fails after 20K. So my hash file will have records 1million-20k and my target will have 20k. Now second time when you restart, the first job wont be invoked and directly second job will be invoked. So you will have 1million-20k records in your hash file.

Approach3:
a. Write a after/before job routine which would read the statististic of each of the link(input or output) in your job and store it in a hash file.
b.call the routine in the same job you want to restart.
c.In the transformer stage, read the source file and pass only those records whose @inrownum is greater than "input link count"(as read from the file that we have created.First time it should be zero(again a logic is required).
Points to Note: Every run should insert a new record for each of the link.
While calling the file in transformer stage, aggreagate(sum) should be used to get the total records processed through a link.
Drawback, you cannot avoid reading all the records from beginning. What you are doing over here is reading all the records, but passing only those records to next stage, which are not processed.

Restart, option always has bit of trade-off in terms of performance. So you need to find an optimum commit point. As well as you need to test all the three Approaches and find the right one.
I hope all the mentioned logic that I could think of, would work.
Post Reply