Page 1 of 2
Disk IO issue - Longer Job run time
Posted: Wed Jun 22, 2016 5:22 pm
by Developer9
Hi ,
We have a simple data stage job that reads from a file and writes to a file and runs about 4-5 minutes but is taking more than 45 minutes in recent weeks quite few times.
Director Log:
Code: Select all
Starting Job name ..
main_program: orchgeneral: loaded
main_program:orchsort: loaded
main_program:orchstats: loaded
06/14/2016 7:00:32 am main_program: APT configuration file: /appl/infoserver/Server/Configurations/2node_Admin.apt
06/14/2016 7:45:32 am main_program: This step has 8 datasets:It runs 10 processes on 2 nodes...
xfm_data,1: APT_PMPlayer: new Player running, PID = 18,153,982, spawned from Section Leader, PID = 22,413,500
Involved DS Admins and Unix admins ..Lot of email threads
![Smile :)](./images/smilies/icon_smile.gif)
came to a conclusion that its a Disk IO issue .. ..having long time to initiate or start up..How to optimize this issue ?How to avoid it in future ?
Please advise any ideas
Posted: Wed Jun 22, 2016 7:20 pm
by Mike
What is the size of the file being read? What is the size of the file being written? What other stages are in your job design? What other processes are running on the server at the same time? Are you doing anything in a Before-job subroutine?
Mike
Posted: Thu Jun 23, 2016 8:13 am
by UCDI
Is it something simple like a nearly full disk? Passing 90% full, disks can do strange things.
Posted: Thu Jun 23, 2016 3:35 pm
by PaulVL
sequential file or dataset?
Are the two nodes in the apt file on the same host or different ones?
Posted: Tue Jun 28, 2016 8:10 am
by Developer9
@PaulVL,Here is the design of job
Code: Select all
Seq File ------->XFM -----------> LKP --------> Dataset (Output)
| ^
|---------| (Reference link)
No complex transformations in transformer ( some constraints on date fields) for the links to LKP stage.
They are on the Same node ..Here is what we have on the config file
Code: Select all
main_program: APT configuration file: Application/infoserver/Server/Configurations/2node_Projectname.apt
{
node "node1"
{
fastname "etlpra"
pools ""
resource disk "/DataStageProjects/Projectname/resource1" {pools ""}
resource disk "/DataStageProjects/Projectname/resource2" {pools ""}
resource scratchdisk "/DataStageProjects/Projectname/scratch1" {pools ""}
resource scratchdisk "/DataStageProjects/Projectname/scratch2" {pools ""}
}
node "node2"
{
fastname "etlpra"
pools ""
resource disk "/DataStageProjects/Projectname/resource2" {pools ""}
resource disk "/DataStageProjects/Projectname/resource1" {pools ""}
resource scratchdisk "/DataStageProjects/Projectname/scratch2" {pools ""}
resource scratchdisk "/DataStageProjects/Projectname/scratch1" {pools ""}
}
}
Posted: Tue Jun 28, 2016 8:27 am
by Developer9
Mike wrote:What is the size of the file being read? What is the size of the file being written? What other stages are in your job design? What other processes are running on the server at the same time? Are you doing anything in a Before-job subroutine?
Mike
@ Mike
Code: Select all
Read /write -20 MB ..No Before-job subroutine
Delay timings are consistent ..CPU and MEMORY are NORMAL.out of 28 disks ..5 are exceptionally highly utilized (85-95 %) at a certain period
Please have your thoughts on diagnose this issue
Thank you
Posted: Tue Jun 28, 2016 8:27 am
by Mike
You didn't answer any of my questions...
With your design using the lookup stage, I would check if you have enough physical memory to support the reference data being preloaded to memory (assuming you are doing a normal lookup). If you exhaust physical memory, using swap space will generate a lot of disk IO and slow down your throughput significantly.
The other thing to realize... physical memory is a shared resource among all of the processes running on your server. You need to monitor resources while your job is in the midst of one of its slow runs.
Mike
Posted: Tue Jun 28, 2016 8:32 am
by PaulVL
TMPDIR is the variable used to define the location where that memory swap file will be written. If it is blank then /tmp is used.
Posted: Tue Jun 28, 2016 8:37 am
by Developer9
@Mike,
I have graphs for cpu/memory (normal usage)..high disk usage on some at delay runs ..Do we have a in-built tool in IBM Information Server suite to monitor resources ?Involved IBM support but I would like to analyse my end
Thank you
Posted: Tue Jun 28, 2016 8:48 am
by Mike
I guess we were typing at the same time...
Your data volumes seem insignificant.
The fork lookup is a potential problem with buffering and deadlocks. All of the reference data for a normal lookup needs to be preloaded to memory. Replace the lookup stage with a join stage for the more typical fork join design pattern.
Perhaps your transformer can take on the transformation requirement that you think was easier for the lookup. It is the source for both stream and reference links, so it would seem it has all of the necessary data.
If all 4 of your locations for scratch and resource are located on the same file system, you're going to get some high disk usage on that file system under heavy loads.
Mike
Posted: Tue Jun 28, 2016 8:53 am
by Mike
I'm still mostly old school when it comes to analyzing resource usage. I like watching an interactive nmon window along with having a Director monitor window open.
It sounds like your admins have already captured the relevant data for you.
Mike
Posted: Tue Jun 28, 2016 10:19 am
by Developer9
@Mike,
The issue is with datastage to start this job. The actual execution is fast But the Delay is at right after the APT config file read event ..Look up operation not yet started at this point
Code: Select all
main_program: Startup time, 25:34; production run time, 0:01.
Posted: Tue Jun 28, 2016 3:10 pm
by PaulVL
Is this a NEW slowness or has it always been around?
Did you recently update your verison of DataStage?
Is this a first time execution of this job in this environment?
Posted: Tue Jun 28, 2016 3:27 pm
by Developer9
We are facing this issue since couple of months (quite few times in a month ) ..No version upgrade still in 8.7 ..same Job is running fine in production at other times with no issues .Its only certain time (repeatedly at morning runs) have this issue
Posted: Tue Jun 28, 2016 5:12 pm
by Mike
From your last description, I would guess that it is related to the load on your server during those morning runs.
One possible cause of a long Startup time is an overloaded server.
Also keep in mind that "Startup time" is a misleading label. Think of it as more of a bucket of time that was not able to be allocated elsewhere.
I once observed that Startup time included the time that it took Netezza to update statistics after a table load. It takes a bit of time to analyze a table with over 1B rows. When I turned that off in the job, all of my extra Startup time disappeared (and the table statistics were actually after all of the jobs other processing was complete... so definitely not Startup time).
Mike