Python urllib2.openurl()
352 views
0
Python 爬取网页
url = 'http://mimvp.com'
req = urllib2.Request(url)
content = urllib2.urlopen(req, timeout=600).read()
content = bs4.BeautifulSoup(content)
content = content.prettify()
�7�e����0*"I߷�G�H����F������9-������;��E�YÞBs���������㔶?�4i���)�����^W�����`w�Ke��%��*9�.
异常提示信息:
/usr/local/lib/python2.7/dist-packages/bs4/dammit.py:231: UnicodeWarning: Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
"Some characters could not be decoded, and were "
解决方案:
headers = { 'Use-Agent' : 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36', 'Accept-Encoding' : 'gzip, deflate, sdch', } url = 'http://mimvp.com' req = urllib2.Request(url, headers=headers) content = urllib2.urlopen(req, timeout=600).read() try: content = gzip.GzipFile('', 'rb', 9, StringIO.StringIO(content)) content = content.read() except: content = StringIO.StringIO(zlib.decompress(content)) content = content.read() content = bs4.BeautifulSoup(content) content = content.prettify()
参考推荐:
Url open encoding (stackoverflow)
版权所有: 本文系米扑博客原创、转载、摘录,或修订后发表,最后更新于 2016-01-31 08:39:46
侵权处理: 本个人博客,不盈利,若侵犯了您的作品权,请联系博主删除,莫恶意,索钱财,感谢!