Datastage Parallel Job Hang

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
epitts88
Participant
Posts: 8
Joined: Tue Nov 08, 2011 8:16 am

Datastage Parallel Job Hang

Post by epitts88 »

Hi,

We have an issue in our environment whereby we randomly get parallel jobs "hang". When I say "hang", I mean that in Director, the jobs stay in a running state. The only way to resolve the issue is to restart the IIS services and re-run the jobs.

When the jobs hang, the osh.exe processes are still present on the Engine Server, yet CPU and RAM are low. In the Director log, its as if the log is sat waiting for a message to be returned from one of the stages. Interestingly, when we stop and start the IIS services, the missed message is written to the log, but the job then obviously aborts due to the services being stopped.

Generally, the hang occurs on one of our generic jobs that has an invocation id and often occurs on an Oracle Enterprise Stage, although it has occurred on other stages.

We have raised a PMR with IBM regarding this, but as of yet we have not managed to find a resolution. We are able to reproduce the error in one of our test environments when we set a loop of jobs running.

Our environment runs on Windows Server 2003, and running on version 8.0.1.2 on a 4 node config. We have tested on a 2node config which also experienced issues. IBM believe it is a resource issue and are looking at performing a tuning exercise on the Desktop Heap registry setting.

We have had this problem for months now and would really appreciate any help we can receive through DSXchange to help us resolve this issue.

Many thanks,
Elliot
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Welcome aboard.

Usually the first thing I look for in a "hang" situation is database locks or deadlocks. Work with your DBA on this. Don't forget the XMETA database.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
epitts88
Participant
Posts: 8
Joined: Tue Nov 08, 2011 8:16 am

Post by epitts88 »

Hi Ray,

Thank you for the response. We have access to see the sessions in the Database and could not see any locks that had caused the latest hang. I also had a look at the XMETA database and although don't have any experience using the Health Centre, there were no mentions of any locks on the XMETA database.

The next time I get a hanging situation in our test environment I will ask one of our DBAs to take a look as well to confirm there are no deadlocks or locks.

As soon as I stopped the "hung" job in Director, the missed message was output. The message was a simple read from a file stating how many records were imported.

Many thanks,
Elliot
epitts88
Participant
Posts: 8
Joined: Tue Nov 08, 2011 8:16 am

Post by epitts88 »

Hi,

Does anyone have any suggestions for my query please? I have now performed a loop of jobs running on one node, as I saw people have suggested doing this to try and rule out the chance of locking situations. However, I am still getting a hanging situation whereby the osh processes are still present on the server, but the log in Director has not updated.

It's as if a message has been missed or not picked up in time for the job to move on to the next stage.

Any help would be greatly appreciated.

Thanks,
Elliot
suse_dk
Participant
Posts: 93
Joined: Thu Aug 11, 2011 6:18 am
Location: Denmark

Post by suse_dk »

Please describe the job design of the generic job....
_________________
- Susanne
epitts88
Participant
Posts: 8
Joined: Tue Nov 08, 2011 8:16 am

Post by epitts88 »

Hi,

It doesn't always fail on the generic job. The latest hang (which I have left in a hanging state whilst I get the DBA to check for locks) is an extract job which writes to 3 sequential files. One for the header, detail records and a trailer record.

The basic job outline is as follows:

Code: Select all

rowGenerator1 --> Transformer1 --> Sequential File1

oracleEnterprise --> Transformer2 --> Sequential File2
                                       |
                                       |
                           Shared Container
                                       |
                                       |
                                Copy Stage
                                       |
                                       |
rowGenerator2    -->  Lookup Stage --> Transformer3 --> Sequential File3
The formatting has gone on the diagram above, so to make it clearer, Transformer2 links to Shared Container, which links to Copy Stage which links to Lookup Stage.


I did not design the job, it was designed by our development team and passed on to us, Applications Support. However, we now do most of the ETL changes and development work.

Looking at the Monitor in Director, it looks like the first Rowgenerator to creating the Sequential file has completed (1 rows, 0 rows/sec)
The Oracle Stage through to the Sequential File2 has also completed (11552 rows, 550 rows/sec) and also the transformer to the Shared Container has completed (11552 rows, 550 rows/sec).

It looks like it has "hung" from the Shared Container to the copy stage, or the rowGenerator2 to the lookup (0 rows, 0 rows/sec).

Thanks,
Elliot
suse_dk
Participant
Posts: 93
Joined: Thu Aug 11, 2011 6:18 am
Location: Denmark

Post by suse_dk »

...and what is happening within the shared container?
_________________
- Susanne
epitts88
Participant
Posts: 8
Joined: Tue Nov 08, 2011 8:16 am

Post by epitts88 »

The basic outline is this:

Code: Select all

Input --> Aggregator
                       |
                       |
rowGenerator --> Join
                             |
                             |
            Output <-- Transformer --> Dataset
                                     |
                                     |
                                Sequential File 
Again, the formatting will have gone so see description.

It takes the Input (COUNT and dummy). This is linked to an aggregator to calculate the COUNT records and outputs this to column OutputRows.

There is then a rowGenerator linked to a Join stage. The join stage is also linked to the aggregator. From this join it outputs the Dummy row from the Row Generator and the OutputRows count from the aggregator.

The join then outputs to the transformer which has three output links. The first is the Output of the container (OutputRows count). The second writes to the sequential file which stores details such as filename, counts etc. It also stores details which are obtained through parameter values from a parameter set (these values are values taken from the database and stored in a .txt file). The final output link is to a Dataset which stores any error codes and descriptions etc, so in the event of an error we can load details to an errors table.

Thanks,
Elliot
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

epitts88 wrote:Again, the formatting will have gone so see description.
The "code" tags preserve whitespace and thus the formatting of your ASCII art. See? :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
epitts88
Participant
Posts: 8
Joined: Tue Nov 08, 2011 8:16 am

Post by epitts88 »

Thanks, i'll use that in the future!
Post Reply