Job Performance

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
shrey3a
Premium Member
Premium Member
Posts: 234
Joined: Sun Nov 21, 2004 10:41 pm

Job Performance

Post by shrey3a »

Hi ,

I've a job which process 9 million rows and takes 12 hours to complete.

Whnever the job fails due to any reason we have to start the job again from the begining. I'm storing the data in to staging area but its of no use suppose the job breaks after 5 million rows. There is no option rather then start the process from begining.

I'm looking for option/ design for job where I can start the job from where it was broken. i.e if it broke at 5 million then I should be able to process the job from the point it last processing the record.

Thanks in Advance!
Regards,
Munish
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

Use multiple job instances and divide the source data into N groups. Use N instances to process the data in 1/N less time, where ideally N is the number of cpus available. Remove all database reference lookups, using hash files. If a single instance fails, re-run just that instance.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

It's conceptually easy to design - you have to keep track of how far you have successfully processed. You also have to design so as to be able to process beginning from a given point.

Quite a few techniques exist. It's easiest if your overall design includes staging areas after extraction and before loading. That way, you can do the transformation part just once. And I recommend sequential files wherever possible for staging; they are the fastest.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
xli
Charter Member
Charter Member
Posts: 74
Joined: Fri May 09, 2003 12:31 am

Post by xli »

For a single job, it takes 12 hrs to process 9 million rows, I think you'd better do something to improve the performance first.

Since you have stored you data in staging area, I guess your job have many complex lookups, try to reduce the number of lookups, and use hashed files as lookups.

xli
talk2shaanc
Charter Member
Charter Member
Posts: 199
Joined: Tue Jan 18, 2005 2:50 am
Location: India

Post by talk2shaanc »

Well I have done bit of R&D on this, and the fastest way i found was, the logic is only for Fact load, it can be modified and can be adopted for dimension: In JobA
I read from a sequential file, in the transformer stage i do error check, data validation, data massaging stuff and then instead of loading into Oracle table, I introduced a staging area (a sequential file).

seq ---> transformer ---> Stgseq file
In JobB
After completion of JobA, I used Sqlldr with direct=True, and used StgSeqfile and loaded into the Oracle table.

in Sqlldr invoking routine, I added some more logic.
1. I always created logs and bad file with timestamp. Secondly before running JobA I used to clear these log files.
2. Before running the Sqlldr, a logic used to loop through the log files and read records_loaded+records rejected for the same file and was used in SKIP option.

This way I had control on restarting of job midway.

I tested the logic on my laptop for 720000 records, total time taken for the load was 20min. approx 17 mins for JobA and 2.40 min for JobB
Shantanu Choudhary
talk2shaanc
Charter Member
Charter Member
Posts: 199
Joined: Tue Jan 18, 2005 2:50 am
Location: India

Post by talk2shaanc »

In the Job Sequence.

JobA ---->JobB(actually a Routine). here is the routine if You need it.


RestartabilitySqlldr(ORACLE_HOME,UserName,DataBaseName,UserPassword,ControlFilePath,ControlFileName,DataFilePath,DataFileName,LogFilePath,LogFileName,BadFilePath,BadFileName,Parallel,SkipUnusableIndexes,SkipIndexMaintenence,DirectLoad,CommitPoint,ErrorLimit) <---parameter passed to routine
-----------------------------------------------------------------------------------
RoutineName='RestartabilitySqlldr'

deffun ExecSqlldr(A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,A16,A17,A18,A19) CALLING "DSU.ExecSqlldr"
Ans=''
Skip=0
Call DSLogInfo('Starting Routine ':RoutineName,TimeDate())
VarTotalRead=0
Var=0
i=1
Call DSLogInfo(RoutineName:' -Reading Through The Log File To Get The SKIP Value : ',TimeDate())
Command='dir /b ':LogFilePath:LogFileName:'*.log'
Call DSExecute("DOS", Command,Output, SystemReturnCode)
If Trim(Output) = 'File Not Found' Then
Skip=0
End Else
loop
if Trim(Output<i>) <> '' Then
OpenSeq LogFilePath:Output<i> To FileVar Then
***************Declare Local Variable******************
RowsLoaded=0
RowsRead=0
RowsRejected=0
RowsDiscarded=0
*******************************************************
Loop
ReadSeq OneLine From FileVar Then

if Trim(Oconv(OneLine,'MC/N')) = 'Rows successfully loaded.' Then
RowsLoaded=Oconv(OneLine,'MCN')
End
if Trim(Oconv(OneLine,'MC/N')) = 'Total logical records rejected:' Then
RowsRejected=Oconv(OneLine,'MCN')
End
if Trim(Oconv(OneLine,'MC/N')) = 'Total logical records discarded:' Then
RowsDiscarded=Oconv(OneLine,'MCN')
End
End Else
VarTotalRead=Var+RowsDiscarded+RowsRejected+RowsLoaded
Var=VarTotalRead
Goto ExitPoint
End
Repeat

End Else
call DSLogFatal(RoutineName:'Unable To open The Log File ',TimeDate())
Ans=-1
End
ExitPoint:
CloseSeq FileVar
End Else
Goto FinalExitPoint
End
i=i+1
Repeat
FinalExitPoint:
Skip=VarTotalRead
End
Call DSLogInfo(RoutineName:' -':Skip:' Records Will Be Skipped :',TimeDate())
TimeStamp=Oconv(TimeDate()[' ',1,1]:Oconv(Iconv(TimeDate()[' ',2,1],"D"),"DMDY[2,2,4]"),"MCN")

Call DSLogInfo('Calling Routine ExecSqlldr -':TimeDate(),RoutineName)
Output=ExecSqlldr(ORACLE_HOME,UserName,DataBaseName,UserPassword,ControlFilePath,ControlFileName,DataFilePath,DataFileName,LogFilePath,LogFileName:'_':TimeStamp,BadFilePath,BadFileName:'_':TimeStamp,Parallel,SkipUnusableIndexes,SkipIndexMaintenence,DirectLoad,CommitPoint,ErrorLimit,Skip)
Call DSLogInfo('Finished Routine ExecSqlldr - ':TimeDate(),RoutineName)

If Output = 0 Then
call DSLogInfo('Completed ': RoutineName:' Successfully - ',TimeDate())
end else
call DSLogFatal(RoutineName :' Generated Error ':' Return Code :-':Output,TimeDate())
End

-------------------------------------------------------------------------------
above function uses another routine "ExecSqlldr". below is the code for the same.

ExecSqlldr(ORACLE_HOME,UserName,DataBaseName,UserPassword,ControlFilePath,ControlFileName,DataFilePath,DataFileName,LogFilePath,LogFileName,BadFilePath,BadFileName,Parallel,SkipUnusableIndexes,SkipIndexMaintenence,DirectLoad,CommitPoint,ErrorLimit,Skip)
-----------------------------------------------------------------------



RoutineName='ExecSqlldr'

Command=ORACLE_HOME:'/bin/sqlldr USERID=':UserName:'/':UserPassword:'@':DataBaseName:' CONTROL=':ControlFilePath:ControlFileName:' DATA=':DataFilePath:DataFileName:' LOG=':LogFilePath:LogFileName:'.log':' BAD=':BadFilePath:BadFileName:'.bad':' PARALLEL=':Parallel:' SKIP_UNUSABLE_INDEXES=':SkipUnusableIndexes:' SKIP_INDEX_MAINTENANCE=':SkipIndexMaintenence:' DIRECT=':DirectLoad:' ROWS=':CommitPoint:' ERRORS=':ErrorLimit:' SKIP=':Skip:' SILENT=HEADER,FEEDBACK'
Call DSLogInfo(RoutineName:'-Starting Sqlldr':TimeDate(),RoutineName)
Call DSExecute("DOS", Command,Output, SystemReturnCode)

If SystemReturnCode = 0 then
Ans='Sqlldr Successfully Completed'
End Else
Ans=SystemReturnCode
Call DSLogInfo('Sqlldr Did Not Complete Successfully, Return Code= ': SystemReturnCode,RoutineName)
Call DSLogFatal( ' ERROR: ':Output,RoutineName)
End
Shantanu Choudhary
talk2shaanc
Charter Member
Charter Member
Posts: 199
Joined: Tue Jan 18, 2005 2:50 am
Location: India

Post by talk2shaanc »

A small correction to first code... I dont know how that happened, in some places there is reference to index "Output<i>", after pasting it in the thread I can see only Output. :(

[img]
Replace::
if Trim(Output) <> '' Then
OpenSeq LogFilePath:Output To FileVar Then

With:
if Trim(Output<i>) <> '' Then
OpenSeq LogFilePath:Output<i> To FileVar Then
[/img][/list][/code]
Shantanu Choudhary
talk2shaanc
Charter Member
Charter Member
Posts: 199
Joined: Tue Jan 18, 2005 2:50 am
Location: India

Post by talk2shaanc »

its again lost/vanished :evil: , its testing my patience...

Code: Select all

Replace::
if Trim(Output) <> '' Then
               OpenSeq LogFilePath:Output To FileVar Then

With:
if Trim(Output<i>) <> '' Then
               OpenSeq LogFilePath:Output<i> To FileVar Then
Shantanu Choudhary
Sainath.Srinivasan
Participant
Posts: 3337
Joined: Mon Jan 17, 2005 4:49 am
Location: United Kingdom

Post by Sainath.Srinivasan »

Some tips -
1.) Use intermediate sequential files
2.) Use Unix named pipes in sequential file stage
3.) Use Direct path and disable indexes
4.) Use IPC stage
5.) Use sqlldr or API with commit size
Post Reply