Parallel Job Hang - Requesting delayed metadata

dthomas949 · Post by **dthomas949** » Wed Jun 16, 2010 10:14 am

This has become quite a large problem for us - any help would be appreciated...

Information Server 8.0.1 Fix Pack 1
System : Linux version 2.6.9-42.ELsmp (bhcompile@ls20-bc1-13.build.redhat.com) (gcc version 3.4.6 20060404 (Red Hat 3.4.6-2)) #1 SMP Wed Jul 12 23:32:02 EDT 2006

Basically we have two main sequences driving - one extracts data calling a remote java utility that reads a master journal file, creates xml from it using a defined a batch size (100,000), and then uses XML inputs (a very large amount) to parse the data and route it to the appropriate data set.
The second sequence upon receiving an indicator that the first has finished a batch will load the data from the datasets to DB2 tables. This is done with very minimal transformation if any at all (in many cases it is just dataset -> db2). There are about 30 of these parallel jobs in total but no more than 5 run in parallel at any given time - 6 of them write to a local database - the other 24 or so write to a remote database.

The issue that we have been seeing repeatedly is one of these parallel jobs (97% of the time its the same one) will hang in a "Running" state and will remain that way indefinitely (we've seen up to 2 full days before killing it). It cannot be stopped by issuing a stop command from Director or using dsjob. This particular job is just dataset -> table, the table is on the remote database, and usually the full 100,000 rows are being moved.

The log on a hung run will appear as follows :
0 STARTED Sun Jun 6 19:16:59 2010
Starting Job rdstLoadJurMainTxn. (...)
1 INFO Sun Jun 6 19:17:00 2010
Environment variable settings: (...)
2 INFO Sun Jun 6 19:17:00 2010
Parallel job initiated
3 INFO Sun Jun 6 19:17:01 2010
main_program: IBM WebSphere DataStage Enterprise Edition 8.0.1.4668 (...)
4 INFO Sun Jun 6 19:17:01 2010
main_program: orchgeneral: loaded (...)
5 INFO Sun Jun 6 19:17:01 2010
main_program: Requesting delayed metadata.
6 INFO Sun Jun 6 19:17:03 2010
main_program: APT configuration file: /opt/ibm/InformationServer/Server/Configurations/default.apt (...)
7 INFO Sun Jun 6 19:17:03 2010
jurMainTxnDB2,0: Logging delayed metadata.
8 INFO Sun Jun 6 19:17:04 2010
jurMainTxnDB2,0: Requesting delayed metadata.

Checking the table reveals the rows were inserted but the "Transaction committed as part of link close processing." never gets communicated. We run this identical design in many different sites but very rarely see this issue, here it is happening very consistently.

We are using a 2-node configuration and we have tried the following solutions:
- auto-purging log files frequently (every 2 days)
- clearing ALL logs nightly
- decreasing our defined batch size

One issue worth noting - we are running DB2 V9.5 - the table belonging to this job is the only one of our 250+ tables utilizing a generated key in the definition, I'm not sure if there is any correlation but it is an odd coincidence...

CREATE TABLE "DB2ADMIN"."JUR_MAIN_TXN_STG" (
"JUR_TXN_ID" BIGINT NOT NULL GENERATED ALWAYS AS IDENTITY (
START WITH +0
INCREMENT BY +1
MINVALUE +0
MAXVALUE +9223372036854775807
NO CYCLE
CACHE 20
NO ORDER ) ,
"JUR_ID" BIGINT NOT NULL WITH DEFAULT 0 ,
"TXN_ID" BIGINT NOT NULL WITH DEFAULT 0 )
COMPRESS YES
IN "TS_MISC_T" INDEX IN "IS_MISC_T" ;

*Added After Original Post* The array size and transaction size are both set to 10,000 each in the db2 stage - are those acceptable values?

Thanks a lot

ArndW · Post by **ArndW** » Thu Jun 17, 2010 3:47 am

Nothing looks incorrect at first read. The hung job you listed seems to only have run for a couple of seconds - or was the job terminated at some later point in time than Sunday the 6th?

If you change that one job to an array and commit size of 50 or 100, does the error persist?

Is there a correlation to the job hanging and the job run's data size?

dthomas949 · Post by **dthomas949** » Mon Jun 21, 2010 9:46 am

Hi - thanks for the reply.

That is how the log looks when the job is hung - it will remain that way until the job process is killed from the command line (it does not respond to stop requests). We've recently tried setting DS_NO_PROCESS_METADATA = 1 but that didnt seem to help as we experienced the hang again two nights ago.

Thanks,
DJ

ArndW · Post by **ArndW** » Mon Jun 21, 2010 10:58 am

How many processes are running for that job? I would try debugging it with $APT_PM_SHOW_PIDS set to true, identifying the process for DB2 and using the UNIX "truss -p {pid}" command to see what systems calls, if any, the process is using.
Also, see if your DB2 DBA can trace process status.

dthomas949 · Post by **dthomas949** » Mon Jun 21, 2010 11:52 am

I'm losing some of your posts due to premium content - the issue we're having in debugging currently is this only occurs on production systems - which we need to request access to. Because of the sporadic nature of this failure there is no guarantee that while our access is enabled and we are monitoring that his hang will even occur.

One thing I did notice however in gathering logs recently was we have been operating under the impression that we were using a 2-node configuration, however it turns out otherwise - we are using a 3-node configuration :

main_program: APT configuration file: /opt/ibm/InformationServer/Server/Configurations/default.apt
{
node "node1"
{
fastname "mirdsa1"
pools ""
resource disk "/db2logs/esr_data/dsdata/ESReporting/datasets" {pools ""}
resource scratchdisk "/db2logs/esr_data/dsdata/ESReporting/scratch" {pools ""}
}
node "node2"
{
fastname "mirdsa1"
pools ""
resource disk "/db2logs/esr_data/dsdata/ESReporting/datasets" {pools ""}
resource scratchdisk "/db2logs/esr_data/dsdata/ESReporting/scratch" {pools ""}
}
node "node3"
{
fastname "mirdsa1"
pools ""
resource disk "/db2logs/esr_data/dsdata/ESReporting/datasets" {pools ""}
resource scratchdisk "/db2logs/esr_data/dsdata/ESReporting/scratch" {pools ""}
}
}.

I ran the same job in the development environment where we are only using 1 node to get the processes - here is what I got :

dsadm 18638 18442 3 02:49 ? 00:00:00 phantom DSD.RUN CopyOfrdstLoadJurMainTxn 0/0/1/0/0
dsadm 18682 1 0 02:49 ? 00:00:00 /bin/sh RT_SC884/OshExecuter.sh R DUMMY -f RT_SC884/OshScript.osh -monitorport 13400 -pf RT_SC884/jpfile -input_charset ASCL_MS1252 -output_charset ASCL_MS1252 -string_fields
dsadm 18683 18682 1 02:49 ? 00:00:00 /opt/ibm/InformationServer/Server/PXEngine/bin/osh -f RT_SC884/OshScript.osh -monitorport 13400 -pf RT_SC884/jpfile -input_charset ASCL_MS1252 -output_charset ASCL_MS1252 -string_fields
dsadm 18684 18638 2 02:49 ? 00:00:00 phantom DSD.OshMonitor CopyOfrdstLoadJurMainTxn 18683 MSEVENTS.FALSE
dsadm 18698 1 0 02:49 ? 00:00:00 /opt/ibm/InformationServer/Server/PXEngine/bin/osh -APT_PMsectionLeaderFlag xxsi814 10000 0 30 node1 xxsi814 1277102951.917812.48fb 0
dsadm 18699 18698 2 02:49 ? 00:00:00 /opt/ibm/InformationServer/Server/PXEngine/bin/osh -APT_PMsectionLeaderFlag xxsi814 10000 0 30 node1 xxsi814 1277102951.917812.48fb 0
dsadm 18700 18698 4 02:49 ? 00:00:00 /opt/ibm/InformationServer/Server/PXEngine/bin/osh -APT_PMsectionLeaderFlag xxsi814 10000 0 30 node1 xxsi814 1277102951.917812.48fb 0
dsadm 18701 18698 2 02:49 ? 00:00:00 /opt/ibm/InformationServer/Server/PXEngine/bin/osh -APT_PMsectionLeaderFlag xxsi814 10000 0 30 node1 xxsi814 1277102951.917812.48fb 0
root 18717 3962 0 02:49 ? 00:00:00 sh -c /files/db2/gtkinst1/sqllib/bin/db2fm -i gtkinst1 -m /opt/ibm/db2/V9.5/lib64/libdb2gcf.so.1 -S 2>&1
gtkinst1 18718 18717 0 02:49 ? 00:00:00 /files/db2/gtkinst1/sqllib/bin/db2fm -i gtkinst1 -m /opt/ibm/db2/V9.5/lib64/libdb2gcf.so.1 -S
gtkinst1 18719 18718 0 02:49 ? 00:00:00 sh -c /files/db2/gtkinst1/sqllib/adm/db2set -i gtkinst1 DB2AUTOSTART 2>&1
gtkinst1 18720 18719 0 02:49 ? 00:00:00 [db2set]

Could the node configuration be a contributing factor here? This mistake would this particular site the only one where we are using more than 2-nodes.

Thanks,
DJ

ArndW · Post by **ArndW** » Tue Jun 22, 2010 2:33 am

Normally if a 2-node job runs so will a 3-node one (this is not necessarily true for a 1-node job going to a 2-node job). It would seem that your problem lies in the DB2 side of things and thus the DataStage partitioning might not be relevant. You could see if switching to a 2-node configuration gets rid of the problem. I would also involve your DB2 DBA to trace active connections when the job "hangs", perhaps the cacheing of the keys is causing issues with multiple processes. Is your DB2 table partitioned? Can you change the JUR_MAIN_TXN_STG table so that the key is generated from within DataStage to see if the error persists?

dthomas949 · Post by **dthomas949** » Wed Jun 23, 2010 11:12 am

None of our staging tables that are populated during this stage of the load are partitioned.

The issue we are seeing with any design changes, including things as small as the key generation, is we cannot reproduce this error anywhere but production so we cannot verify the change is effective and therefore any design change is met with apprehension. That also brings up the point that there is certainly a correlation to volume of data - as not only do we only experience this hang in a production environment but we only see it in this one particular site - which happens to be our largest current site running on Datastage.

One other thing to note about this particular table - before the inserts into our staging tables occur we have a process that shuts off the enforcement of foreign keys on all the staging tables - this table does have a foreign key that is always enforced as it references a table with static data.

Now with all of that being said - while the hang will seemingly occur on this particular job 90% of the time it has (as recently as 2 nights ago) hung on others - the jobs have very similar design : dataset -> db2stage with very little transformation/lookup in between if any, the tables have no generated keys, no dependent database actions on insert, no foreign keys.....

So to break it down in a much broader, general sense very quickly :

load is broken up into 4 main phases

no identifiable timing pattern when hang occurs

no single job is always the culprit

only consistency is it always occurs in stage 1 - which is when the most CPU usage occurs as an extraction process routing xml in many different directions and parsing it is creating the next datasets to be loaded in parallel to these inserts.

Are there any system logs I should be investigated? Any Datastage logs outside of the normal job logs? Any db2 logs outside of db2diag.log?

Thanks again - your input is appreciated.