Python 爬取网页

url = 'http://mimvp.com'
req = urllib2.Request(url)
content = urllib2.urlopen(req, timeout=600).read()
content = bs4.BeautifulSoup(content)
content = content.prettify()

�7�e����0*"I߷�G�H����F������9-������;��E�YÞBs���������㔶?�4i���)�����^W�����`w�Ke��%��*9�.

 

异常提示信息:

/usr/local/lib/python2.7/dist-packages/bs4/dammit.py:231: UnicodeWarning: Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
  "Some characters could not be decoded, and were "

 

解决方案:

headers = {     
                'Use-Agent'          :   'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36',
                'Accept-Encoding'    :   'gzip, deflate, sdch',
		  }

url = 'http://mimvp.com'
req = urllib2.Request(url, headers=headers)
content = urllib2.urlopen(req, timeout=600).read()

try:
    content = gzip.GzipFile('', 'rb', 9, StringIO.StringIO(content))
    content = content.read()
except:
    content = StringIO.StringIO(zlib.decompress(content))
    content = content.read()

content = bs4.BeautifulSoup(content)
content = content.prettify()

 

 

参考推荐:

Url open encoding  (stackoverflow)