Problems with dsjob running VERY slow

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
bcarlson
Premium Member
Premium Member
Posts: 772
Joined: Fri Oct 01, 2004 3:06 pm
Location: Minnesota

Problems with dsjob running VERY slow

Post by bcarlson »

We are having a strange slow down in our production environment where dsjob seems to be returning data VERY slowly.

There was a code install today that, at first glance, seems to happen around the same time. However, after investigating some logs, we found runtimes AFTER the install that were normal, and then about 45 minutes later started running slow.

We run all of our DataStage jobs from Unix scripts. These scripts call a home-brew function called run_dsjob that just acts as a wrapper for calling dsjob and facilitates all of the logging of data that is sent back from DS while the job is running. This function has not changed in months. The only thing it does is call dsjob for either a) retrieving information about the job and its logs, or b) for actually kicking off the DataStage job.

When we watch the job in Director, the runtime appears to be normal (although we did get a few slow start ups). On the other hand, if we watch the logs on the Unix side, it takes forever for the information in Director to be written to our logs.

The very first thing that our function does is use dsjob to pull a list of parameters for the job. Then the function iterates through that list pulling details about each parm. This is strictly informational, and happens before the job ever starts. Here are the commands:

Code: Select all

params=`dsjob -lparams $ds_project $ds_job`

for param in $params
do
  dsjob -paraminfo $ds_project $ds_job $param
done
If I watch this writing to the Unix log file, I can watch every character show up on screen and grab coffee between each one. On the other hand, I can run the exact same commands from the command line and it works fine and fast.

Another portion of the function gathers information AFTER the job completes. This uses a home-grown adaptation of dsjob for pulling job logs faster, called dsjob_fast_log_extract. This has not changed in 10 months or more. This program is slow either from the function or the command line.

So far, it looks like the slowdown occurs when DataStage is sending data back to the process - like log data. As I mentioned, the actual runtime of the job (in Director) seems more normal. We made some modest improvement of overall runtime (i.e. user clock time) by reducing the amount of logging that was returned. We had several debugging options setup for a previous issue and had hundreds of extra lines in the logs that had to be retrieved. Removing those improved the runtime because fewer log records had to be pulled from DataStage into our Unix logs.

One final observation was the runtime of an orchadmin command to delete a dataset. Normally this takes 1-2 seconds. In this test, it took 2+ minutes.

So, there are the symptoms we are seeing. We have checked for errors on our disk, and have not found any in the recent past. We have watched iostat, vmstat, and netstat and found no issues with our I/O. When all of these issues occured, they were literally the ONLY things running on a 16-CPU system (IBM P5-series) that can process 20-30 million records and write to a DB2 table in a little over a minute. The slow job imported only a few hundred thousand records and some runtimes took up to 35 minutes.

System load for the tests were 0.3 or less - extremely small, since there were nothing else running. So, system load and I/O were not overloaded, and there doesn't appear to be anything wrong with the disk.

Tomorrow, we will be having our System Admins research if any software or config changes were made during this time. At this point, we are not aware of any.

Has anyone run into this before? Can you think of anything else we can check?

Any help would be greatly appreciated!

Brad.
Cr.Cezon
Participant
Posts: 101
Joined: Mon Mar 05, 2007 4:59 am
Location: Madrid

Post by Cr.Cezon »

Hello Brad,

DataStage use several resources to run a job,

it seems like the internal files of DSEngine could be great o corrupted.

try with this.
-Clear Status File of the job you are running
-Clear Log of the job you are running
-Clear &PH& file

if with this you dont see better performance try with:
-Rebuild Repository indices

Regards,
Cristina
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

The logs that you are retrieving, are they huge? As in, how often do you purge the logs. It takes longer to retrieve and go through 7 days of logs than to go through 2 days of logs.
It might just be a shell issue. As you said if you run the dsjob -lparam command from command line its fast but within the script it takes forever. Test out by creating a small script that just generates numbers. See how fast this script runs.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

If it's only startup that's slow, it might be that there are thousands of old entries in the &PH& subdirectory. You can lose all of them.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
bcarlson
Premium Member
Premium Member
Posts: 772
Joined: Fri Oct 01, 2004 3:06 pm
Location: Minnesota

Post by bcarlson »

Quick update on what's going on...

The problem disappeared that evening. We had slow runtimes from about 1:30 in the afternoon until between 7-8pm that evening. Then everything was running fine, and it hasn't happened since. As far as our SLA's (server level agreements) go, that is good news. However, it is very disconcerting to realize we still don't know why it was happening.

As for the postings y'all have made, I appreciate the insight and I'll attempt to address answers here....
  • 1. Our logs are pretty small. Due to the fact that we run everything from Unix, all of our DataStage logs are captured in text files at runtime. Therefore, we purge the logs after 1 day to keep them as small as possible.

    2. During the slowdown, we repeatedly removed all logs for our test job, just to ensure that the jobs had as little to return as possible. And I did test other jobs in other projects and ha the same issue will all of them.

    3. Not sure what &PH& is - can you elaborate? I do a lot of the DS Admin work here, but haven't exactly been trained to do so and haven't figured everything out yet (give me another day or two :wink: )

    4. I assume "Rebuild Repository Indices" is referring to "DS.REINDEX ALL"? We thought of doing that, but then had the same problem occuring in multiple projects. That made us think the issue was outside of DataStage and more Unix-server/environment related.
We are looking into some other options right now. We narrowed down the timeframe for when the slowdown appeared to start. It happened sometime between 1:30 and 2:15. We are tryign to find out what could have happened to the system during that timeframe.

We know for sure that at the same time, our Sys Admins were doing some testing of VCS failovers (Veritas Cluster Server). One of our techs said that bus resets (hmm, not sure what that is) could possibly affect it. He is researching this now.

So, the problem is gone (for now), but we are still puzzled. If anyone thinks of anything else, I am open to suggestions!

Thanks!

Brad.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Every background process ("phantom") that DataStage runs redirects its output to a subdirectory called &PH& in the project directory on the server. If &PH& contains thousands of files, job startup can be negatively impacted.

The VCS reconfiguration might very well have cause the symptoms you have observed, particularly if your I/O was using channels affected by the VCS. I have noted similar things in the past - one classic was some large batch updates into Oracle that no-one bothered to tell the ETL team about. So a job that ran fine most of the time sat and waited for three hours while the batch updates held table-level locks. And, needless to say, the SLA for having the DW ready for start-of-business was not met on those occasions.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply