Disk IO issue - Longer Job run time

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Developer9
Premium Member
Premium Member
Posts: 187
Joined: Thu Apr 14, 2011 5:10 pm

Disk IO issue - Longer Job run time

Post by Developer9 »

Hi ,

We have a simple data stage job that reads from a file and writes to a file and runs about 4-5 minutes but is taking more than 45 minutes in recent weeks quite few times.

Director Log:

Code: Select all

Starting Job name ..
main_program: orchgeneral: loaded
main_program:orchsort: loaded
main_program:orchstats: loaded
06/14/2016 7:00:32 am main_program: APT configuration file: /appl/infoserver/Server/Configurations/2node_Admin.apt
06/14/2016 7:45:32 am main_program: This step has 8 datasets:It runs 10 processes on 2 nodes...
xfm_data,1: APT_PMPlayer: new Player running, PID = 18,153,982, spawned from Section Leader, PID = 22,413,500
Involved DS Admins and Unix admins ..Lot of email threads :) came to a conclusion that its a Disk IO issue .. ..having long time to initiate or start up..How to optimize this issue ?How to avoid it in future ?

Please advise any ideas
Mike
Premium Member
Premium Member
Posts: 1021
Joined: Sun Mar 03, 2002 6:01 pm
Location: Tampa, FL

Post by Mike »

What is the size of the file being read? What is the size of the file being written? What other stages are in your job design? What other processes are running on the server at the same time? Are you doing anything in a Before-job subroutine?

Mike
UCDI
Premium Member
Premium Member
Posts: 383
Joined: Mon Mar 21, 2016 2:00 pm

Post by UCDI »

Is it something simple like a nearly full disk? Passing 90% full, disks can do strange things.
PaulVL
Premium Member
Premium Member
Posts: 1315
Joined: Fri Dec 17, 2010 4:36 pm

Post by PaulVL »

sequential file or dataset?

Are the two nodes in the apt file on the same host or different ones?
Developer9
Premium Member
Premium Member
Posts: 187
Joined: Thu Apr 14, 2011 5:10 pm

Post by Developer9 »

@PaulVL,Here is the design of job

Code: Select all

Seq File ------->XFM ----------->  LKP  --------> Dataset (Output)
                     |            ^
                     |---------| (Reference link)                    
No complex transformations in transformer ( some constraints on date fields) for the links to LKP stage.

They are on the Same node ..Here is what we have on the config file

Code: Select all

main_program: APT configuration file: Application/infoserver/Server/Configurations/2node_Projectname.apt
{
        node "node1"
        {
                fastname "etlpra"
                pools ""
                resource disk "/DataStageProjects/Projectname/resource1" {pools ""}
                resource disk "/DataStageProjects/Projectname/resource2" {pools ""}
                resource scratchdisk "/DataStageProjects/Projectname/scratch1" {pools ""}
                resource scratchdisk "/DataStageProjects/Projectname/scratch2" {pools ""}
        }

        node "node2"
        {
                fastname "etlpra"
                pools ""
                resource disk "/DataStageProjects/Projectname/resource2" {pools ""}
                resource disk "/DataStageProjects/Projectname/resource1" {pools ""}
                resource scratchdisk "/DataStageProjects/Projectname/scratch2" {pools ""}
                resource scratchdisk "/DataStageProjects/Projectname/scratch1" {pools ""}
        }
}
Developer9
Premium Member
Premium Member
Posts: 187
Joined: Thu Apr 14, 2011 5:10 pm

Post by Developer9 »

Mike wrote:What is the size of the file being read? What is the size of the file being written? What other stages are in your job design? What other processes are running on the server at the same time? Are you doing anything in a Before-job subroutine?

Mike
@ Mike

Code: Select all

Read /write -20 MB ..No Before-job subroutine
Delay timings are consistent ..CPU and MEMORY are NORMAL.out of 28 disks ..5 are exceptionally highly utilized (85-95 %) at a certain period
Please have your thoughts on diagnose this issue

Thank you
Last edited by Developer9 on Tue Jun 28, 2016 8:27 am, edited 1 time in total.
Mike
Premium Member
Premium Member
Posts: 1021
Joined: Sun Mar 03, 2002 6:01 pm
Location: Tampa, FL

Post by Mike »

You didn't answer any of my questions...

With your design using the lookup stage, I would check if you have enough physical memory to support the reference data being preloaded to memory (assuming you are doing a normal lookup). If you exhaust physical memory, using swap space will generate a lot of disk IO and slow down your throughput significantly.

The other thing to realize... physical memory is a shared resource among all of the processes running on your server. You need to monitor resources while your job is in the midst of one of its slow runs.

Mike
PaulVL
Premium Member
Premium Member
Posts: 1315
Joined: Fri Dec 17, 2010 4:36 pm

Post by PaulVL »

TMPDIR is the variable used to define the location where that memory swap file will be written. If it is blank then /tmp is used.
Developer9
Premium Member
Premium Member
Posts: 187
Joined: Thu Apr 14, 2011 5:10 pm

Post by Developer9 »

@Mike,

I have graphs for cpu/memory (normal usage)..high disk usage on some at delay runs ..Do we have a in-built tool in IBM Information Server suite to monitor resources ?Involved IBM support but I would like to analyse my end

Thank you
Mike
Premium Member
Premium Member
Posts: 1021
Joined: Sun Mar 03, 2002 6:01 pm
Location: Tampa, FL

Post by Mike »

I guess we were typing at the same time...

Your data volumes seem insignificant.

The fork lookup is a potential problem with buffering and deadlocks. All of the reference data for a normal lookup needs to be preloaded to memory. Replace the lookup stage with a join stage for the more typical fork join design pattern.

Perhaps your transformer can take on the transformation requirement that you think was easier for the lookup. It is the source for both stream and reference links, so it would seem it has all of the necessary data.

If all 4 of your locations for scratch and resource are located on the same file system, you're going to get some high disk usage on that file system under heavy loads.

Mike
Mike
Premium Member
Premium Member
Posts: 1021
Joined: Sun Mar 03, 2002 6:01 pm
Location: Tampa, FL

Post by Mike »

I'm still mostly old school when it comes to analyzing resource usage. I like watching an interactive nmon window along with having a Director monitor window open.

It sounds like your admins have already captured the relevant data for you.

Mike
Developer9
Premium Member
Premium Member
Posts: 187
Joined: Thu Apr 14, 2011 5:10 pm

Post by Developer9 »

@Mike,

The issue is with datastage to start this job. The actual execution is fast But the Delay is at right after the APT config file read event ..Look up operation not yet started at this point

Code: Select all

main_program: Startup time, 25:34; production run time, 0:01.
PaulVL
Premium Member
Premium Member
Posts: 1315
Joined: Fri Dec 17, 2010 4:36 pm

Post by PaulVL »

Is this a NEW slowness or has it always been around?

Did you recently update your verison of DataStage?

Is this a first time execution of this job in this environment?
Developer9
Premium Member
Premium Member
Posts: 187
Joined: Thu Apr 14, 2011 5:10 pm

Post by Developer9 »

We are facing this issue since couple of months (quite few times in a month ) ..No version upgrade still in 8.7 ..same Job is running fine in production at other times with no issues .Its only certain time (repeatedly at morning runs) have this issue
Mike
Premium Member
Premium Member
Posts: 1021
Joined: Sun Mar 03, 2002 6:01 pm
Location: Tampa, FL

Post by Mike »

From your last description, I would guess that it is related to the load on your server during those morning runs.

One possible cause of a long Startup time is an overloaded server.

Also keep in mind that "Startup time" is a misleading label. Think of it as more of a bucket of time that was not able to be allocated elsewhere.

I once observed that Startup time included the time that it took Netezza to update statistics after a table load. It takes a bit of time to analyze a table with over 1B rows. When I turned that off in the job, all of my extra Startup time disappeared (and the table statistics were actually after all of the jobs other processing was complete... so definitely not Startup time).

Mike
Post Reply