Disk IO issue - Longer Job run time

Developer9 · Post by **Developer9** » Wed Jun 22, 2016 5:22 pm

Hi ,

We have a simple data stage job that reads from a file and writes to a file and runs about 4-5 minutes but is taking more than 45 minutes in recent weeks quite few times.

Director Log:

Code: Select all

Starting Job name ..
main_program: orchgeneral: loaded
main_program:orchsort: loaded
main_program:orchstats: loaded
06/14/2016 7:00:32 am main_program: APT configuration file: /appl/infoserver/Server/Configurations/2node_Admin.apt
06/14/2016 7:45:32 am main_program: This step has 8 datasets:It runs 10 processes on 2 nodes...
xfm_data,1: APT_PMPlayer: new Player running, PID = 18,153,982, spawned from Section Leader, PID = 22,413,500

Involved DS Admins and Unix admins ..Lot of email threads

came to a conclusion that its a Disk IO issue .. ..having long time to initiate or start up..How to optimize this issue ?How to avoid it in future ?

Please advise any ideas

Mike · Post by **Mike** » Wed Jun 22, 2016 7:20 pm

What is the size of the file being read? What is the size of the file being written? What other stages are in your job design? What other processes are running on the server at the same time? Are you doing anything in a Before-job subroutine?

Mike

UCDI · Post by **UCDI** » Thu Jun 23, 2016 8:13 am

Is it something simple like a nearly full disk? Passing 90% full, disks can do strange things.

PaulVL · Post by **PaulVL** » Thu Jun 23, 2016 3:35 pm

sequential file or dataset?

Are the two nodes in the apt file on the same host or different ones?

Developer9 · Post by **Developer9** » Tue Jun 28, 2016 8:10 am

@PaulVL,Here is the design of job

Code: Select all

Seq File ------->XFM ----------->  LKP  --------> Dataset (Output)
                     |            ^
                     |---------| (Reference link)

No complex transformations in transformer ( some constraints on date fields) for the links to LKP stage.

They are on the Same node ..Here is what we have on the config file

Code: Select all

main_program: APT configuration file: Application/infoserver/Server/Configurations/2node_Projectname.apt
{
        node "node1"
        {
                fastname "etlpra"
                pools ""
                resource disk "/DataStageProjects/Projectname/resource1" {pools ""}
                resource disk "/DataStageProjects/Projectname/resource2" {pools ""}
                resource scratchdisk "/DataStageProjects/Projectname/scratch1" {pools ""}
                resource scratchdisk "/DataStageProjects/Projectname/scratch2" {pools ""}
        }

        node "node2"
        {
                fastname "etlpra"
                pools ""
                resource disk "/DataStageProjects/Projectname/resource2" {pools ""}
                resource disk "/DataStageProjects/Projectname/resource1" {pools ""}
                resource scratchdisk "/DataStageProjects/Projectname/scratch2" {pools ""}
                resource scratchdisk "/DataStageProjects/Projectname/scratch1" {pools ""}
        }
}

Developer9 · Post by **Developer9** » Tue Jun 28, 2016 8:27 am

Mike wrote:What is the size of the file being read? What is the size of the file being written? What other stages are in your job design? What other processes are running on the server at the same time? Are you doing anything in a Before-job subroutine?

Mike

@ Mike

Code: Select all

Read /write -20 MB ..No Before-job subroutine
Delay timings are consistent ..CPU and MEMORY are NORMAL.out of 28 disks ..5 are exceptionally highly utilized (85-95 %) at a certain period

Please have your thoughts on diagnose this issue

Thank you

Mike · Post by **Mike** » Tue Jun 28, 2016 8:27 am

You didn't answer any of my questions...

With your design using the lookup stage, I would check if you have enough physical memory to support the reference data being preloaded to memory (assuming you are doing a normal lookup). If you exhaust physical memory, using swap space will generate a lot of disk IO and slow down your throughput significantly.

The other thing to realize... physical memory is a shared resource among all of the processes running on your server. You need to monitor resources while your job is in the midst of one of its slow runs.

Mike

PaulVL · Post by **PaulVL** » Tue Jun 28, 2016 8:32 am

TMPDIR is the variable used to define the location where that memory swap file will be written. If it is blank then /tmp is used.

Developer9 · Post by **Developer9** » Tue Jun 28, 2016 8:37 am

@Mike,

I have graphs for cpu/memory (normal usage)..high disk usage on some at delay runs ..Do we have a in-built tool in IBM Information Server suite to monitor resources ?Involved IBM support but I would like to analyse my end

Thank you

Mike · Post by **Mike** » Tue Jun 28, 2016 8:48 am

I guess we were typing at the same time...

Your data volumes seem insignificant.

The fork lookup is a potential problem with buffering and deadlocks. All of the reference data for a normal lookup needs to be preloaded to memory. Replace the lookup stage with a join stage for the more typical fork join design pattern.

Perhaps your transformer can take on the transformation requirement that you think was easier for the lookup. It is the source for both stream and reference links, so it would seem it has all of the necessary data.

If all 4 of your locations for scratch and resource are located on the same file system, you're going to get some high disk usage on that file system under heavy loads.

Mike

Mike · Post by **Mike** » Tue Jun 28, 2016 8:53 am

I'm still mostly old school when it comes to analyzing resource usage. I like watching an interactive nmon window along with having a Director monitor window open.

It sounds like your admins have already captured the relevant data for you.

Mike

Developer9 · Post by **Developer9** » Tue Jun 28, 2016 10:19 am

@Mike,

The issue is with datastage to start this job. The actual execution is fast But the Delay is at right after the APT config file read event ..Look up operation not yet started at this point

Code: Select all

main_program: Startup time, 25:34; production run time, 0:01.

PaulVL · Post by **PaulVL** » Tue Jun 28, 2016 3:10 pm

Is this a NEW slowness or has it always been around?

Did you recently update your verison of DataStage?

Is this a first time execution of this job in this environment?

Developer9 · Post by **Developer9** » Tue Jun 28, 2016 3:27 pm

We are facing this issue since couple of months (quite few times in a month ) ..No version upgrade still in 8.7 ..same Job is running fine in production at other times with no issues .Its only certain time (repeatedly at morning runs) have this issue

Mike · Post by **Mike** » Tue Jun 28, 2016 5:12 pm

From your last description, I would guess that it is related to the load on your server during those morning runs.

One possible cause of a long Startup time is an overloaded server.

Also keep in mind that "Startup time" is a misleading label. Think of it as more of a bucket of time that was not able to be allocated elsewhere.

I once observed that Startup time included the time that it took Netezza to update statistics after a table load. It takes a bit of time to analyze a table with over 1B rows. When I turned that off in the job, all of my extra Startup time disappeared (and the table statistics were actually after all of the jobs other processing was complete... so definitely not Startup time).

Mike