Delays between Job Sequences / Calling Next Job

bmouton · Post by **bmouton** » Wed Apr 08, 2009 12:28 pm

I'm getting inconsistent performance times with Datastage on a recurring basis for the same volume of data being processed.

Brief Description:
1 Project - Multiple sequences executing SEQUENTIALLY.

The first run completed in 12 minutes (overall data processed 200 MB). The second run in 12 minutes.

Three hours later, less data being processed (50 MB), the Master Job Sequence ran 32 minutes.

Datastage is the only process running on the machine (apart from the OS).

I know this is extremely vague. I have no idea where to start looking.

From one cycle to the next, we can see as much as a two hour difference.

We have bounced Datastage, DB2 9.1 (Repository), the Linux Server multiple times. Sometimes, DS flies and other times it crawls like a snail ...

I have searched IBM's website and DSXchange to see if anyone has encountered this type of issue.

We have been monitoring CPU, Memory, and IO on both the Datastage Server and the DB2 9.5 Server. When the cycle runs fast or slow, the CPU, Memory, and IO are the SAME.

Help!

ray.wurlod · Post by **ray.wurlod** » Wed Apr 08, 2009 12:30 pm

What do the jobs do? In particular do they access data over a network? Have you checked that the network might be the bottleneck, because everyone's downloading videos on it?

bmouton · Post by **bmouton** » Wed Apr 08, 2009 12:45 pm

We are running fiber channel to our SAN ... We have been monitoring activity on the network and the SAN. There are no spikes in either when this inconsistency occurs.

It's not a bandwidth issue. We are 2 GIG-E fiber ... That's not the problem ...

And the network has limited sharing. We have noticed that the problem occurs RANDOMLY. There does not appear to be an issue with the network or the SAN.

kcbland · Post by **kcbland** » Wed Apr 08, 2009 12:48 pm

By saying ALL jobs are slower, we can't easily help pinpoint problems. If we could talk about a specific job then we can focus in on exact issues.

For example, if a job runs twice in the same day, processes different volumes, but has longer runtimes for the smaller volumes, maybe we could talk about the profile of the data. Maybe the larger volumes were more inserts then updates and loaded quicker. However, smaller volumes but more updates can take much longer.

If you see an across the board degradation and can't explain why, that points to hardware more than data profile. A simple job that reads and writes between sequential files should operate at a consistent pace given excess cpu resources (notice I said pace and not time). If that type of job ran at a different pace then you should investigate your disks - your processes could be starving for data or having issues writing out its data.

I recommend focus on a few simple jobs and use those to measure your performance differential. A simple job that extracts a table and dumps to a file without much transformation/lookup logic is a great example to measure if there's network traffic issues dumping out the data. Another example is the seq-->xfm--->seq type job to point out cpu/disk issues. If you have complicated jobs that mix database with transformation and more database loading you're in a nearly impossible position to troubleshoot without breaking down the jobs.

ray.wurlod · Post by **ray.wurlod** » Wed Apr 08, 2009 12:56 pm

My reason for asking about remote data is that the variability might be in the load on database servers.

bmouton · Post by **bmouton** » Wed Apr 08, 2009 1:26 pm

OK ... Here are more specifics:

Same data volume. Same exact data set ...

I run the Master Job Sequence that calls a set of job sequences SEQUENTIALLY.

1st Run - 12 minutes
Empty the staging tables, and Data mart tables.

2nd Run - 12 minutes
Empty the staging tables, and Data mart tables.

3rd Run - 12 minutes
Empty the staging tables, and Data mart tables.

15th Run - 35 minutes
Empty the staging tables, and Data mart tables.

16th Run - 35 minutes
Empty the staging tables, and Data mart tables.

22nd Run - 18 minutes
Empty the staging tables, and Data mart tables.

So ... during those times we monitored network traffic, database traffic, datastage traffic, CPU, Memory, and IO for all related boxes.

No apparent issues.

Are there settings in Datastage that default to a "Datastage Adjusted" (e.g. automatically managed by DS) that are not part of the normal install? In DB2 9.5 running on SLES10, that we could not use automatic memory management. Additionally, we had to manually configure CPU speed (I'd never done that in my career).

So ... It doesn't seem to be a job thing ... It's in Datastage somewhere ...

While going out and creating test jobs sounds like a wonderful thing, I have done that ... And I am unable to duplicate the same problem ...

chulett · Post by **chulett** » Wed Apr 08, 2009 2:14 pm

Clarify something here. Your subject says "Delays between Job Sequences / Calling Next Job" and yet you never mention anything about that in your posts, just mentioning total run time for the sequence of jobs. So... when the overall sequence goes from 12 minutes to 35 minutes for the exact same data, are the individual jobs taking incrementally longer to process that data? Or, as per your subject, do the individual jobs all run in approximately the same time and the delay is all in-between jobs?

bmouton · Post by **bmouton** » Wed Apr 08, 2009 2:28 pm

Fair enough ...

We have been monitoring through Director ...

It is unfortunate that we cannot repeat the same "lag" between jobs / job sequences.

For example, the Master Job Sequence will start, then a minute later the first job in the Sequence will start. It will run for 10 seconds, then the next job may or may not start immediately.

The problem is that sometimes they fire off as soon as the predecessor completes. Other times, it "stalls" for a minute to up to 10 minutes before starting the next job in the sequence.

chulett · Post by **chulett** » Wed Apr 08, 2009 2:44 pm

First thing to check would be the number of entries in the Project's &PH& "phantom" directory. Large or out of control numbers there could induce a processing lag. If needed, clear out anything 2+ days old and see if that helps.

bmouton · Post by **bmouton** » Wed Apr 08, 2009 2:56 pm

Craig,

Not to sound stupid ... Should I bring down DS? or simply stopping running the jobs????

Thanks!! Very Helpful Info!

bmouton · Post by **bmouton** » Wed Apr 08, 2009 3:08 pm

Got the command ...

We'll see how it goes now ...

Thanks!

chulett · Post by **chulett** » Wed Apr 08, 2009 10:00 pm

Sorry for the late response. No need to bring down DataStage and the "2+ days old" criteria was to avoid effecting any running jobs.

bmouton · Post by **bmouton** » Thu Apr 09, 2009 8:19 am

That cleaned out the phantom files ...

Unfortunately. no improvement ...

We are thinking about updating the statistics on the xmetadb to see if that would help.

We are also removing our temporary hash files that we use in the process to see if that helps ...

bmouton · Post by **bmouton** » Fri Apr 10, 2009 10:05 am

Help!!!

So ... We have been tracking the times that the jobs run themselves ... The jobs fly ... In and out in seconds ...

The issue appears (but not certain) in Datastage ... We have approximately 80 jobs that all run sequentially ...

All of the jobs were being checkpointed ... We removed 80% of the checkpoints and no improvements ...

Any ideas on what cause the jobs to pause before calling the next job?

What is even more strange is that we have 5 identical Datastage Cycles (Job Sequences and jobs) pointing to its respective file system and database. The smaller database (and volume of data) that runs takes longer than the database that have 10 times more data.

I'm reaching for straws here ... Could it be bufferpool size in the DB2 9.1 Xmetadb repository? SLES 10 not working well with DS?

Please help!!!

ray.wurlod · Post by **ray.wurlod** » Fri Apr 10, 2009 10:27 am

We're with you but also clutching at straws.

What has your official support provider had to say?

Alas this is a scenario that would be difficult to reproduce - it would need comparable hardware but also the long time period over which to degrade the elapsed times.

Have you been generating operational metadata? (This is collected into the XMETA database - maybe the fact that those tables are increasing in size and need to manage their tablespace is part of this problem.) Note that this is only another straw - I can not say for certain.

I suspect that, whatever the solution is, it will need to be found by someone expert actually poking around in your system. Things can appear different on the other side of the glass.