DS v9 -- DB2 Connection Closed error

rjhankey · Post by **rjhankey** » Tue Oct 28, 2014 7:21 am

I'm fairly convinced this isn't a DataStage issue, but ... the job that runs fine in v8 is having issues in v9.

There's a job with a fairly complex and heavy query that runs well in v8, but is encountering a "DB2 connection closed" error in v9. The text of the error is: CLI0106E Connection is closed. SQLSTATE=08003 SQL30081N A communication error has been detected. Communication function detecting the error: "recv". Protocol specific error codes: "78", "*", "*"

Doing some research -- the suggestions with a "78" are to look at modifying any of the following: DB2TCP_CLIENT_CONTIMEOUT, QueryTimeoutInterval in db2cli.ini, or the network may be slow and we need to adjust tcp_keepinit.

I verified that all of these settings are the same on our v8 & v9 servers, so we haven't missed a setup step on v9 that we did previously on v8. I have suggested to the developer that the job needs to be modified to use v9 DB2 stages, to see if that improves the situation. I suppose another answer might be to split the job up if possible, so that the SQL query isn't nearly as complex.

The communication taking place has also been shown to have network issues from our DataStage server in Rochester NY to Boulder CO. I actually noticed that pings from the DS server to the DB2 DB will spike sporadically, sometimes taking as much as 12k ms for 64 bytes, then dropping back down to 60 ms. So, the references I'm finding that seem to point to network issues appear to align.

http://knowledgebase.progress.com/artic ... icle/20017

http://www-01.ibm.com/support/docview.w ... wg21164785

As I said, I'm not sure this can be solved here -- just curious if anyone else has noticed this behavior when migrating jobs from v8 to v9, and if you found a solution / workaround?

qt_ky · Post by **qt_ky** » Mon Nov 10, 2014 6:50 am

Try opening a new Support case, if you haven't already.

rjhankey · Post by **rjhankey** » Mon Nov 10, 2014 8:15 am

That's good advice -- and we actually have DB2 & AIX Support engaged right now. All signs are pointing to something involved with the network, so our next step is to get network teams involved from the Rochester & Boulder end of things.

Something is causing the ack responses from the Boulder end of things to not make their way back to the keepalive requests that are coming from Rochester. We'll likely need to seek answers from the network teams to get to the bottom of that behavior.

electajay · Post by **electajay** » Mon Nov 10, 2014 9:21 am

we found similar issue on our Environment. and we involed IBM Datastage, DB2 and AIX Engineering Teams to find the issue. and finally they asked us to apply the Aix patch on Aix server. db2cli is just hanging there and waiting from DB2 side, on Db2 side also same issue it is also waiting for some thing.

IBM AIX engineering team sent us the patch for TCPIP at OS level, after applying the patch the jobs are running fine with out any delay and better performance is seen.

rjhankey · Post by **rjhankey** » Mon Nov 10, 2014 9:26 am

Do you recall which patch / OS level? We're running 7.1.0.0 right now.

electajay · Post by **electajay** » Mon Nov 10, 2014 2:34 pm

Please are the details that i got from our AIX Team

This is the patch that they applied = TL 9 SP3

$ oslevel -s
6100-09-03-1415

Thanks

rjhankey · Post by **rjhankey** » Mon Nov 10, 2014 2:52 pm

We're on a different version (v7 as opposed to v6) ... so the same approach may not work as well here. But, I will pass along the information that an AIX-level patch helped out with a similar scenario.

> oslevel -s
7100-02-04-1341

electajay · Post by **electajay** » Mon Nov 10, 2014 2:58 pm

we also upgraded from 7.5 to 8.7, and facing so many issues 90% of the jobs seen better performance in new environment but 10% of the jobs showing us hell

resolving one by one and this issue is one of them. Discuss with you Aix team and check with IBM also. I can give you the PMR number if you want.

rjhankey · Post by **rjhankey** » Tue Dec 02, 2014 2:45 pm

We suspect this is now a resolved issue for us ... We pursued this for nearly two months, opening a DB2/AIX PMR (77162,122,000) and also investigated our network for packet loss.

Once we realized that the jobs ran well in DS v8 (on AIX 6 / DB2 9), and started noticing this issue in DS v9 (AIX 7 / DB2 10), we revisited the DB2 side of things. I had also modified our testing so I was able to get the error when just running command line (db2batch) queries, without involving DS.

There's a setting in DB2 10 that behaves differently than it did in v9 ... DB2TCP_CLIENT_KEEPALIVE_TIMEOUT

In DB2 10, that defaults to sending keepalive probes every 5 seconds by default, whereas in DB2 9, it defaulted to the OS setting.

We have changed that to default again to the OS (db2set DB2TCP_CLIENT_KEEPALIVE_TIMEOUT=0) ... and all seems to be running smoothly again.

rjhankey · Post by **rjhankey** » Tue Dec 02, 2014 2:46 pm

We suspect this is now a resolved issue for us ... We pursued this for nearly two months, opening a DB2/AIX PMR (77162,122,000) and also investigated our network for packet loss.

Once we realized that the jobs ran well in DS v8 (on AIX 6 / DB2 9), and started noticing this issue in DS v9 (AIX 7 / DB2 10), we revisited the DB2 side of things. I had also modified our testing so I was able to get the error when just running command line (db2batch) queries, without involving DS.

There's a setting in DB2 10 that behaves differently than it did in v9 ... DB2TCP_CLIENT_KEEPALIVE_TIMEOUT

In DB2 10, that defaults to sending keepalive probes every 5 seconds by default, whereas in DB2 9, it defaulted to the OS setting.

We have changed that to default again to the OS (db2set DB2TCP_CLIENT_KEEPALIVE_TIMEOUT=0) ... and all seems to be running smoothly again.

chulett · Post by **chulett** » Tue Dec 02, 2014 3:19 pm

Nice catch - thanks for posting the resolution.