Jobs "Hanging"

tonystark622 · Post by **tonystark622** » Thu Jul 29, 2004 6:45 pm

I have been having intermittent problems with some DataStage jobs "hanging" on my production system. It appears that a link going into a transformer will finish, but the output link will never finish. All these jobs read flat files, do look ups against hash files and write a flat file for output. None of them write to the same files.

I noticed tonight that at the same time that I had one job "hung" like this, I had another job was also "hung". This job reads a flat file, transforms it to XML and writes it to another flat file. It the stages had all finished (all links were green in designer, show Performance Statistics), but the after job routine wouldn't run. These jobs stayed this way for over 20 minutes.

I finally stopped the first job and was dealing with getting it reset and ready to run again when I noticed that the second job had finished normally. It seemed to me as I "freed" or "unlocked" something when I stopped the first job and it allowed the second job to continue and finish normally. These jobs do _not_ have anything to do with each other and do _not_ access the same flat files or hash files.

I also wanted to add that when these jobs "hang" --- I stop them, reset them, kill any left over processes from these jobs, and restart them. Most of the time they run normally and finish fine the second time.

I also had Row Buffering enabled. I have turned it off on all these jobs and recompiled them in the last week or so, but I'm still getting this problem. I have also checked my free disk space and it looks like I have plenty free on /tmp, and in the project directory (sort puts some temporary files there), and in my data directories.

Any ideas anyone? This has been killing our production system lately and I'm at the point where I can't let our production jobs run unattended.

Thanks,
Tony

chulett · Post by **chulett** » Thu Jul 29, 2004 7:56 pm

Tony - how are you managing your Phantom directory? I've found that you can see issues like this when there are a large number of entries in the &PH& directory under any given project.

You can try a couple of things. Short term - With no jobs running, log on to the Administrator and issue a CLEAR.FILE &PH& command against the affected Project. Long term - build yourself a script and schedule it in something like cron, a script that motors into that directory and deletes any files over a certain number of days old. Run this every day. I generally keep a rolling 2 or 3 days in there max, but it really depends on the volume of jobs you have running on a daily basis.

tonystark622 · Post by **tonystark622** » Thu Jul 29, 2004 8:09 pm

Thanks, Craig. I'll look into it. I generally delete these by hand and haven't even thought about it for a while. Keep your fingers crossed. This is killing me.

Tony

Neil C · Post by **Neil C** » Wed Aug 04, 2004 8:27 pm

Craig, could you define 'large' please? We do not clean up &PH& (so far!) and I find that we have about 4,000 files in this directory. I would assume that this could be causing us some grief?

Thanks,
Neil Courtney

ray.wurlod · Post by **ray.wurlod** » Wed Aug 04, 2004 10:48 pm

Does the after-job routine attempt to do anything with the target files of the job?

chulett · Post by **chulett** » Wed Aug 04, 2004 11:37 pm

Neil C wrote:Craig, could you define 'large' please?

Not really.

Large is a relative term and would vary from system to system. I'm just saying that, once a job has completed, that phantom entry has served its purpose and is no longer required. It is a Good Practice to help keep their numbers under control and this is generally handled by regular sweeps of old phantoms via a cron-type script.

On the off chance that it could cause you some grief, I'd suggest clearing yours out - if not on a regular basis - then at least once in a while when you think of it.

Now I'm curious if Tony ever figured out what is particular issue actually was.

Klaus Schaefer · Post by **Klaus Schaefer** » Thu Aug 05, 2004 1:20 am

Tony, are you still having this problem? What is your exact environment (Solaris 8?, DS 7.1r1?). If yes, there is a specific issue with this Solaris release which leads to similar problems to describe.

Please contact Ascential support.

Klaus

tonystark622 · Post by **tonystark622** » Thu Aug 05, 2004 7:06 am

Yes and no. My environment is: HP-UX 11.11, DataStage v7.1.0 (the original release, not one of the patch releases since then).

I'm working with Ascential support now. Here's what I believe to be the main problems:

1) There seems to be an issue (read bug) with Row Buffering causing job "hangs" intermittently under some circumstances.

2) I had a problem where I turned off Row Buffering on my jobs and it didn't turn off. We were able to determine that Row Buffering was still enabled because 1) I had a process for each active stage, instead of one process for the whole job; and 2) I was getting an "ipc_getnext()" and "timeout waiting for mutex" error when I stopped this job after a "hang". Re-enabling Row Buffers, compiling, then disabling it and compiling seemed to resolve this issue.

3) One one of the jobs that was experiencing "hangs", I have several cascaded transformers (8 or 9) with a sort stage near the beginning of this string (XF->Sort->XF->XF->XF->XF->XF->XF->XF->hash_file->XF->Seq file). When row buffering was enabled on this job, I would get continuous count updates on all the links between transformers after the sort stage as the job ran. After disabling row buffering, the row count incremented in real time in the link going into the sort stage, but didn't change on any of the other links up to the hash file, until all these transformers were done. This looked to me like the job was still "hanging" every tiime I ran it. In fact, the job is working and finishes OK if I let it run long enough (the job takes about 50 minutes to run).

4) Some of my jobs read from /dev/null to initialize a hash file at the beginning of a job and right after that I write to /dev/null. I'm doing this to keep anything else in the job from happening until the hash file is initialized. Looks kinda like this:
Seq File (/dev/null)->XF->hash file(clear before writing)->XF->Seq File (/dev/null in, data file out)->XF (and the rest of the job continues from here). I did this because the hash file is written to from multiple places and I wanted to clear it before anything else happens. The support folks suspect that there may be an issue reading from and writing to /dev/null within the same process, now that we've turned row buffering off. I have re-written these jobs to initialize the hash file from a transformer only with no read from /dev/null.

5) I had the "Enable hashed file cache sharing" Job Property turned on in some of these jobs. I have disabled it for now as a test.

6) I changed the configuration parameter MFILES from 50 to 120, and T30FILE from 200 to 300; regened and restarted DataStage.

I think that's where I'm at for now. I'm still testing these jobs. I had one "hang" yesterday when run from our main job sequencer job. It ran ok twice after that when I ran it manually. I've incorporated some of the changes listed above in this job, so I'm goiing to test it more today.

The support folks have been great!

Tony

chucksmith · Post by **chucksmith** » Thu Aug 05, 2004 8:20 am

To initialize files (your number 4), you do not need an input stage. Just link a transform to the output file stage (sequential, hash, oci, etc.).

In the transform, create a stage variable and set its value to @FALSE.

In the output link constraint, set it to stage variable = @TRUE.

As you can see, the constraint will never be true, so no rows will be written, but any initialization conditions, like clear file before writing or create file will be executed.

For hash files, this type of job can be used to create them in a directory, and then as an after job stage, a VOC pointer can be created (with my CreateVocPtr routine) so subsequent jobs can access the hash file as if it was created in the project.

tonystark622 · Post by **tonystark622** » Thu Aug 05, 2004 8:31 am

Thanks, Chuck.

Yep, logically that's pretty much what I did (I used a different test, but same thing).

Tony

DSXchange

Jobs "Hanging"

Jobs "Hanging"

Creating/Initializing files and tables...