We have a simple data stage job that reads from a file and writes to a file and runs about 4-5 minutes but is taking more than 45 minutes in recent weeks quite few times.
Starting Job name ..
main_program: orchgeneral: loaded
main_program:orchsort: loaded
main_program:orchstats: loaded
06/14/2016 7:00:32 am main_program: APT configuration file: /appl/infoserver/Server/Configurations/2node_Admin.apt
06/14/2016 7:45:32 am main_program: This step has 8 datasets:It runs 10 processes on 2 nodes...
xfm_data,1: APT_PMPlayer: new Player running, PID = 18,153,982, spawned from Section Leader, PID = 22,413,500
Involved DS Admins and Unix admins ..Lot of email threads came to a conclusion that its a Disk IO issue .. ..having long time to initiate or start up..How to optimize this issue ?How to avoid it in future ?
What is the size of the file being read? What is the size of the file being written? What other stages are in your job design? What other processes are running on the server at the same time? Are you doing anything in a Before-job subroutine?
Mike wrote:What is the size of the file being read? What is the size of the file being written? What other stages are in your job design? What other processes are running on the server at the same time? Are you doing anything in a Before-job subroutine?
Read /write -20 MB ..No Before-job subroutine
Delay timings are consistent ..CPU and MEMORY are NORMAL.out of 28 disks ..5 are exceptionally highly utilized (85-95 %) at a certain period
Please have your thoughts on diagnose this issue
Thank you
Last edited by Developer9 on Tue Jun 28, 2016 8:27 am, edited 1 time in total.
With your design using the lookup stage, I would check if you have enough physical memory to support the reference data being preloaded to memory (assuming you are doing a normal lookup). If you exhaust physical memory, using swap space will generate a lot of disk IO and slow down your throughput significantly.
The other thing to realize... physical memory is a shared resource among all of the processes running on your server. You need to monitor resources while your job is in the midst of one of its slow runs.
I have graphs for cpu/memory (normal usage)..high disk usage on some at delay runs ..Do we have a in-built tool in IBM Information Server suite to monitor resources ?Involved IBM support but I would like to analyse my end
The fork lookup is a potential problem with buffering and deadlocks. All of the reference data for a normal lookup needs to be preloaded to memory. Replace the lookup stage with a join stage for the more typical fork join design pattern.
Perhaps your transformer can take on the transformation requirement that you think was easier for the lookup. It is the source for both stream and reference links, so it would seem it has all of the necessary data.
If all 4 of your locations for scratch and resource are located on the same file system, you're going to get some high disk usage on that file system under heavy loads.
I'm still mostly old school when it comes to analyzing resource usage. I like watching an interactive nmon window along with having a Director monitor window open.
It sounds like your admins have already captured the relevant data for you.
The issue is with datastage to start this job. The actual execution is fast But the Delay is at right after the APT config file read event ..Look up operation not yet started at this point
We are facing this issue since couple of months (quite few times in a month ) ..No version upgrade still in 8.7 ..same Job is running fine in production at other times with no issues .Its only certain time (repeatedly at morning runs) have this issue
From your last description, I would guess that it is related to the load on your server during those morning runs.
One possible cause of a long Startup time is an overloaded server.
Also keep in mind that "Startup time" is a misleading label. Think of it as more of a bucket of time that was not able to be allocated elsewhere.
I once observed that Startup time included the time that it took Netezza to update statistics after a table load. It takes a bit of time to analyze a table with over 1B rows. When I turned that off in the job, all of my extra Startup time disappeared (and the table statistics were actually after all of the jobs other processing was complete... so definitely not Startup time).