Parallel Job Hang - Requesting delayed metadata
Posted: Wed Jun 16, 2010 10:14 am
This has become quite a large problem for us - any help would be appreciated...
Information Server 8.0.1 Fix Pack 1
System : Linux version 2.6.9-42.ELsmp (bhcompile@ls20-bc1-13.build.redhat.com) (gcc version 3.4.6 20060404 (Red Hat 3.4.6-2)) #1 SMP Wed Jul 12 23:32:02 EDT 2006
Basically we have two main sequences driving - one extracts data calling a remote java utility that reads a master journal file, creates xml from it using a defined a batch size (100,000), and then uses XML inputs (a very large amount) to parse the data and route it to the appropriate data set.
The second sequence upon receiving an indicator that the first has finished a batch will load the data from the datasets to DB2 tables. This is done with very minimal transformation if any at all (in many cases it is just dataset -> db2). There are about 30 of these parallel jobs in total but no more than 5 run in parallel at any given time - 6 of them write to a local database - the other 24 or so write to a remote database.
The issue that we have been seeing repeatedly is one of these parallel jobs (97% of the time its the same one) will hang in a "Running" state and will remain that way indefinitely (we've seen up to 2 full days before killing it). It cannot be stopped by issuing a stop command from Director or using dsjob. This particular job is just dataset -> table, the table is on the remote database, and usually the full 100,000 rows are being moved.
The log on a hung run will appear as follows :
0 STARTED Sun Jun 6 19:16:59 2010
Starting Job rdstLoadJurMainTxn. (...)
1 INFO Sun Jun 6 19:17:00 2010
Environment variable settings: (...)
2 INFO Sun Jun 6 19:17:00 2010
Parallel job initiated
3 INFO Sun Jun 6 19:17:01 2010
main_program: IBM WebSphere DataStage Enterprise Edition 8.0.1.4668 (...)
4 INFO Sun Jun 6 19:17:01 2010
main_program: orchgeneral: loaded (...)
5 INFO Sun Jun 6 19:17:01 2010
main_program: Requesting delayed metadata.
6 INFO Sun Jun 6 19:17:03 2010
main_program: APT configuration file: /opt/ibm/InformationServer/Server/Configurations/default.apt (...)
7 INFO Sun Jun 6 19:17:03 2010
jurMainTxnDB2,0: Logging delayed metadata.
8 INFO Sun Jun 6 19:17:04 2010
jurMainTxnDB2,0: Requesting delayed metadata.
Checking the table reveals the rows were inserted but the "Transaction committed as part of link close processing." never gets communicated. We run this identical design in many different sites but very rarely see this issue, here it is happening very consistently.
We are using a 2-node configuration and we have tried the following solutions:
- auto-purging log files frequently (every 2 days)
- clearing ALL logs nightly
- decreasing our defined batch size
One issue worth noting - we are running DB2 V9.5 - the table belonging to this job is the only one of our 250+ tables utilizing a generated key in the definition, I'm not sure if there is any correlation but it is an odd coincidence...
CREATE TABLE "DB2ADMIN"."JUR_MAIN_TXN_STG" (
"JUR_TXN_ID" BIGINT NOT NULL GENERATED ALWAYS AS IDENTITY (
START WITH +0
INCREMENT BY +1
MINVALUE +0
MAXVALUE +9223372036854775807
NO CYCLE
CACHE 20
NO ORDER ) ,
"JUR_ID" BIGINT NOT NULL WITH DEFAULT 0 ,
"TXN_ID" BIGINT NOT NULL WITH DEFAULT 0 )
COMPRESS YES
IN "TS_MISC_T" INDEX IN "IS_MISC_T" ;
*Added After Original Post* The array size and transaction size are both set to 10,000 each in the db2 stage - are those acceptable values?
Thanks a lot
Information Server 8.0.1 Fix Pack 1
System : Linux version 2.6.9-42.ELsmp (bhcompile@ls20-bc1-13.build.redhat.com) (gcc version 3.4.6 20060404 (Red Hat 3.4.6-2)) #1 SMP Wed Jul 12 23:32:02 EDT 2006
Basically we have two main sequences driving - one extracts data calling a remote java utility that reads a master journal file, creates xml from it using a defined a batch size (100,000), and then uses XML inputs (a very large amount) to parse the data and route it to the appropriate data set.
The second sequence upon receiving an indicator that the first has finished a batch will load the data from the datasets to DB2 tables. This is done with very minimal transformation if any at all (in many cases it is just dataset -> db2). There are about 30 of these parallel jobs in total but no more than 5 run in parallel at any given time - 6 of them write to a local database - the other 24 or so write to a remote database.
The issue that we have been seeing repeatedly is one of these parallel jobs (97% of the time its the same one) will hang in a "Running" state and will remain that way indefinitely (we've seen up to 2 full days before killing it). It cannot be stopped by issuing a stop command from Director or using dsjob. This particular job is just dataset -> table, the table is on the remote database, and usually the full 100,000 rows are being moved.
The log on a hung run will appear as follows :
0 STARTED Sun Jun 6 19:16:59 2010
Starting Job rdstLoadJurMainTxn. (...)
1 INFO Sun Jun 6 19:17:00 2010
Environment variable settings: (...)
2 INFO Sun Jun 6 19:17:00 2010
Parallel job initiated
3 INFO Sun Jun 6 19:17:01 2010
main_program: IBM WebSphere DataStage Enterprise Edition 8.0.1.4668 (...)
4 INFO Sun Jun 6 19:17:01 2010
main_program: orchgeneral: loaded (...)
5 INFO Sun Jun 6 19:17:01 2010
main_program: Requesting delayed metadata.
6 INFO Sun Jun 6 19:17:03 2010
main_program: APT configuration file: /opt/ibm/InformationServer/Server/Configurations/default.apt (...)
7 INFO Sun Jun 6 19:17:03 2010
jurMainTxnDB2,0: Logging delayed metadata.
8 INFO Sun Jun 6 19:17:04 2010
jurMainTxnDB2,0: Requesting delayed metadata.
Checking the table reveals the rows were inserted but the "Transaction committed as part of link close processing." never gets communicated. We run this identical design in many different sites but very rarely see this issue, here it is happening very consistently.
We are using a 2-node configuration and we have tried the following solutions:
- auto-purging log files frequently (every 2 days)
- clearing ALL logs nightly
- decreasing our defined batch size
One issue worth noting - we are running DB2 V9.5 - the table belonging to this job is the only one of our 250+ tables utilizing a generated key in the definition, I'm not sure if there is any correlation but it is an odd coincidence...
CREATE TABLE "DB2ADMIN"."JUR_MAIN_TXN_STG" (
"JUR_TXN_ID" BIGINT NOT NULL GENERATED ALWAYS AS IDENTITY (
START WITH +0
INCREMENT BY +1
MINVALUE +0
MAXVALUE +9223372036854775807
NO CYCLE
CACHE 20
NO ORDER ) ,
"JUR_ID" BIGINT NOT NULL WITH DEFAULT 0 ,
"TXN_ID" BIGINT NOT NULL WITH DEFAULT 0 )
COMPRESS YES
IN "TS_MISC_T" INDEX IN "IS_MISC_T" ;
*Added After Original Post* The array size and transaction size are both set to 10,000 each in the db2 stage - are those acceptable values?
Thanks a lot