There was a code install today that, at first glance, seems to happen around the same time. However, after investigating some logs, we found runtimes AFTER the install that were normal, and then about 45 minutes later started running slow.
We run all of our DataStage jobs from Unix scripts. These scripts call a home-brew function called run_dsjob that just acts as a wrapper for calling dsjob and facilitates all of the logging of data that is sent back from DS while the job is running. This function has not changed in months. The only thing it does is call dsjob for either a) retrieving information about the job and its logs, or b) for actually kicking off the DataStage job.
When we watch the job in Director, the runtime appears to be normal (although we did get a few slow start ups). On the other hand, if we watch the logs on the Unix side, it takes forever for the information in Director to be written to our logs.
The very first thing that our function does is use dsjob to pull a list of parameters for the job. Then the function iterates through that list pulling details about each parm. This is strictly informational, and happens before the job ever starts. Here are the commands:
Code: Select all
params=`dsjob -lparams $ds_project $ds_job`
for param in $params
do
dsjob -paraminfo $ds_project $ds_job $param
done
Another portion of the function gathers information AFTER the job completes. This uses a home-grown adaptation of dsjob for pulling job logs faster, called dsjob_fast_log_extract. This has not changed in 10 months or more. This program is slow either from the function or the command line.
So far, it looks like the slowdown occurs when DataStage is sending data back to the process - like log data. As I mentioned, the actual runtime of the job (in Director) seems more normal. We made some modest improvement of overall runtime (i.e. user clock time) by reducing the amount of logging that was returned. We had several debugging options setup for a previous issue and had hundreds of extra lines in the logs that had to be retrieved. Removing those improved the runtime because fewer log records had to be pulled from DataStage into our Unix logs.
One final observation was the runtime of an orchadmin command to delete a dataset. Normally this takes 1-2 seconds. In this test, it took 2+ minutes.
So, there are the symptoms we are seeing. We have checked for errors on our disk, and have not found any in the recent past. We have watched iostat, vmstat, and netstat and found no issues with our I/O. When all of these issues occured, they were literally the ONLY things running on a 16-CPU system (IBM P5-series) that can process 20-30 million records and write to a DB2 table in a little over a minute. The slow job imported only a few hundred thousand records and some runtimes took up to 35 minutes.
System load for the tests were 0.3 or less - extremely small, since there were nothing else running. So, system load and I/O were not overloaded, and there doesn't appear to be anything wrong with the disk.
Tomorrow, we will be having our System Admins research if any software or config changes were made during this time. At this point, we are not aware of any.
Has anyone run into this before? Can you think of anything else we can check?
Any help would be greatly appreciated!
Brad.