DS v9 -- DB2 Connection Closed error

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
rjhankey
Premium Member
Premium Member
Posts: 28
Joined: Mon Sep 19, 2011 2:04 pm
Location: Portage MI

DS v9 -- DB2 Connection Closed error

Post by rjhankey »

I'm fairly convinced this isn't a DataStage issue, but ... the job that runs fine in v8 is having issues in v9.

There's a job with a fairly complex and heavy query that runs well in v8, but is encountering a "DB2 connection closed" error in v9. The text of the error is: CLI0106E Connection is closed. SQLSTATE=08003 SQL30081N A communication error has been detected. Communication function detecting the error: "recv". Protocol specific error codes: "78", "*", "*"

Doing some research -- the suggestions with a "78" are to look at modifying any of the following: DB2TCP_CLIENT_CONTIMEOUT, QueryTimeoutInterval in db2cli.ini, or the network may be slow and we need to adjust tcp_keepinit.

I verified that all of these settings are the same on our v8 & v9 servers, so we haven't missed a setup step on v9 that we did previously on v8. I have suggested to the developer that the job needs to be modified to use v9 DB2 stages, to see if that improves the situation. I suppose another answer might be to split the job up if possible, so that the SQL query isn't nearly as complex.

The communication taking place has also been shown to have network issues from our DataStage server in Rochester NY to Boulder CO. I actually noticed that pings from the DS server to the DB2 DB will spike sporadically, sometimes taking as much as 12k ms for 64 bytes, then dropping back down to 60 ms. So, the references I'm finding that seem to point to network issues appear to align.

http://knowledgebase.progress.com/artic ... icle/20017

http://www-01.ibm.com/support/docview.w ... wg21164785

As I said, I'm not sure this can be solved here -- just curious if anyone else has noticed this behavior when migrating jobs from v8 to v9, and if you found a solution / workaround?
qt_ky
Premium Member
Premium Member
Posts: 2895
Joined: Wed Aug 03, 2011 6:16 am
Location: USA

Post by qt_ky »

Try opening a new Support case, if you haven't already.
Choose a job you love, and you will never have to work a day in your life. - Confucius
rjhankey
Premium Member
Premium Member
Posts: 28
Joined: Mon Sep 19, 2011 2:04 pm
Location: Portage MI

Post by rjhankey »

That's good advice -- and we actually have DB2 & AIX Support engaged right now. All signs are pointing to something involved with the network, so our next step is to get network teams involved from the Rochester & Boulder end of things.

Something is causing the ack responses from the Boulder end of things to not make their way back to the keepalive requests that are coming from Rochester. We'll likely need to seek answers from the network teams to get to the bottom of that behavior.
electajay
Participant
Posts: 36
Joined: Thu Apr 15, 2010 11:19 am

Post by electajay »

we found similar issue on our Environment. and we involed IBM Datastage, DB2 and AIX Engineering Teams to find the issue. and finally they asked us to apply the Aix patch on Aix server. db2cli is just hanging there and waiting from DB2 side, on Db2 side also same issue it is also waiting for some thing.

IBM AIX engineering team sent us the patch for TCPIP at OS level, after applying the patch the jobs are running fine with out any delay and better performance is seen.
A Kumar
rjhankey
Premium Member
Premium Member
Posts: 28
Joined: Mon Sep 19, 2011 2:04 pm
Location: Portage MI

Post by rjhankey »

Do you recall which patch / OS level? We're running 7.1.0.0 right now.
electajay
Participant
Posts: 36
Joined: Thu Apr 15, 2010 11:19 am

Post by electajay »

Please are the details that i got from our AIX Team

This is the patch that they applied = TL 9 SP3

$ oslevel -s
6100-09-03-1415

Thanks
A Kumar
rjhankey
Premium Member
Premium Member
Posts: 28
Joined: Mon Sep 19, 2011 2:04 pm
Location: Portage MI

Post by rjhankey »

We're on a different version (v7 as opposed to v6) ... so the same approach may not work as well here. But, I will pass along the information that an AIX-level patch helped out with a similar scenario.

> oslevel -s
7100-02-04-1341
electajay
Participant
Posts: 36
Joined: Thu Apr 15, 2010 11:19 am

Post by electajay »

we also upgraded from 7.5 to 8.7, and facing so many issues 90% of the jobs seen better performance in new environment but 10% of the jobs showing us hell :) resolving one by one and this issue is one of them. Discuss with you Aix team and check with IBM also. I can give you the PMR number if you want.
A Kumar
rjhankey
Premium Member
Premium Member
Posts: 28
Joined: Mon Sep 19, 2011 2:04 pm
Location: Portage MI

Post by rjhankey »

We suspect this is now a resolved issue for us ... We pursued this for nearly two months, opening a DB2/AIX PMR (77162,122,000) and also investigated our network for packet loss.

Once we realized that the jobs ran well in DS v8 (on AIX 6 / DB2 9), and started noticing this issue in DS v9 (AIX 7 / DB2 10), we revisited the DB2 side of things. I had also modified our testing so I was able to get the error when just running command line (db2batch) queries, without involving DS.

There's a setting in DB2 10 that behaves differently than it did in v9 ... DB2TCP_CLIENT_KEEPALIVE_TIMEOUT

In DB2 10, that defaults to sending keepalive probes every 5 seconds by default, whereas in DB2 9, it defaulted to the OS setting.

We have changed that to default again to the OS (db2set DB2TCP_CLIENT_KEEPALIVE_TIMEOUT=0) ... and all seems to be running smoothly again.
rjhankey
Premium Member
Premium Member
Posts: 28
Joined: Mon Sep 19, 2011 2:04 pm
Location: Portage MI

Post by rjhankey »

We suspect this is now a resolved issue for us ... We pursued this for nearly two months, opening a DB2/AIX PMR (77162,122,000) and also investigated our network for packet loss.

Once we realized that the jobs ran well in DS v8 (on AIX 6 / DB2 9), and started noticing this issue in DS v9 (AIX 7 / DB2 10), we revisited the DB2 side of things. I had also modified our testing so I was able to get the error when just running command line (db2batch) queries, without involving DS.

There's a setting in DB2 10 that behaves differently than it did in v9 ... DB2TCP_CLIENT_KEEPALIVE_TIMEOUT

In DB2 10, that defaults to sending keepalive probes every 5 seconds by default, whereas in DB2 9, it defaulted to the OS setting.

We have changed that to default again to the OS (db2set DB2TCP_CLIENT_KEEPALIVE_TIMEOUT=0) ... and all seems to be running smoothly again.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Nice catch - thanks for posting the resolution.
-craig

"You can never have too many knives" -- Logan Nine Fingers
Post Reply