network hangup
前几天因工作需要,使用python的httplib库写了一个简单的测试服务脚本。但是出现个好玩的问题。
这里描述一下:
- client端 即运行脚本的服务器
- server端 即被脚本访问的服务器
如果server突然crash后, client端的脚本会出现一直hangup的情况(不是100%出现)
下面就是在现场找到的信息:
[root@lvs1_s ~]# strace -p 18235 (脚本的PID号)
Process 18235 attached - interrupt to quit
read(8, ^C <unfinished ...>
Process 18235 detached
一直hangup在read系统调用中, file descriptor 8是什么? 查看一下:
[root@lvs1_s ~]# lsof -p 18235|grep 8u
python 18235 root 8u IPv4 23796661 0t0 TCP 58.xxx.xx.116:36217->58.xxx.xx.85:https (ESTABLISHED)
python 18235 root 88u IPv4 23795354 0t0 TCP self:49292->10.176.22.42:https (ESTABLISHED)
可以看出, 是socket。抓包也不见有任何数据:
[root@lvs1_s ~]# tcpdump -i eth1 -n 'ip host 58.xxx.xx.85 and port 443'
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth1, link-type EN10MB (Ethernet), capture size 65535 bytes
^C
0 packets captured
0 packets received by filter
0 packets dropped by kernel
解决这个问题,可以有几种方法,比如:
- 使用thread
- 使用child process
- 不使用httplib, 直接使用socket,这样可以使用更多的IO特性
选1的话,把thread给hangup了也是个问题.
选2的话,看上去不够优雅
选3的话,又懒得去分析HTTP协议, 特别是https!.
经过测试,发现只有https才会有这种问题. 测试方法如下:
client端:
import os
import traceback
import sys
import threading
import httplib
import ssl
connection = httplib.HTTPSConnection(host="10.241.31.44", port=8082, timeout=7)
connection.request("GET", "/", headers={})
response = connection.getresponse()
print response.status, response.read()
server端:
import os
import sys
import traceback
import socket
import select
import time
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.bind(('0.0.0.0', 8082))
sock.listen(10)
while True:
try:
s = sock.accept()[0]
fd = s.fileno()
fd_list = [fd]
write_list = select.select([], fd_list, [])[1]
d = 0
for fd in write_list:
time.sleep(50) # 为了方便重现问题这里sleep, 在这50秒里把网线拨了就出现hangup
os.close(fd)
fd_list.remove(fd)
except:
traceback.print_exc()
print 'error!'
time.sleep(5)
为了找个这个问题的原因, 查看httplib库的HTTPSConnection类,发现它的作用和下面的代码一致(当然也引发了这个hangup的问题):
import socket
import ssl
timeout = 10
socket.setdefaulttimeout(timeout)
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.connect(("10.241.31.44", 8082))
ssl_sock= ssl.wrap_socket(sock, None, None)
很显示,问题出在ssl库的上. 再跟进ssl库,可以看到如下的代码:
# yes, create the SSL object
self._sslobj = _ssl.sslwrap(self._sock, server_side,
keyfile, certfile,
cert_reqs, ssl_version, ca_certs)
if do_handshake_on_connect:
timeout = self.gettimeout()
try:
self.settimeout(None)
self.do_handshake()
finally:
self.settimeout(timeout)
我靠! 因为do_handshake_on_connect参数, 让HTTPSConnection类实例化时timeout参数在do_handshake()之前被清了. fuck~~~~~~
看到只能把HTTPSConnection类的connect方法override一下了.
HTTPSConnection原代码是这样的:
class HTTPSConnection(HTTPConnection):
"This class allows communication via SSL."
default_port = HTTPS_PORT
def __init__(self, host, port=None, key_file=None, cert_file=None,
strict=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT):
HTTPConnection.__init__(self, host, port, strict, timeout)
self.key_file = key_file
self.cert_file = cert_file
def connect(self):
"Connect to a host on a given (SSL) port."
sock = socket.create_connection((self.host, self.port), self.timeout)
if self._tunnel_host:
self.sock = sock
self._tunnel()
self.sock = ssl.wrap_socket(sock, self.key_file, self.cert_file)
把connect方法override:
import socket
import os
import traceback
import sys
import threading
import httplib
import ssl
from _ssl import CERT_NONE, PROTOCOL_SSLv23
class MyHTTPSConnection(httplib.HTTPSConnection):
def connect(self):
"Connect to a host on a given (SSL) port."
sock = socket.create_connection((self.host, self.port), self.timeout)
if self._tunnel_host:
self.sock = sock
self._tunnel()
#self.sock = ssl.wrap_socket(sock, self.key_file, self.cert_file)
self.sock = ssl.wrap_socket(sock, keyfile=None, certfile=None,
server_side=False, cert_reqs=CERT_NONE,
ssl_version=PROTOCOL_SSLv23, ca_certs=None,
do_handshake_on_connect=False,
suppress_ragged_eofs=True)
# 在允许超时的情况下调用do_handshake()
self.sock.do_handshake()
connection = MyHTTPSConnection(host="10.241.31.44", port=8082, timeout=7)
connection.request("GET", "/", headers={})
response = connection.getresponse()
print response.status, response.read()
哈, override后的MyHTTPSConnection一切正常