网页内容解析提取,一般用到了 re(正则表达式)、BeautifulSoup、lxml 

米扑博客,将在本文将对其进行性能对比测试

 

爬取网址https://mimvp.com

爬取内容:用户ID、发表段子文字信息、好笑数量、评价数量

爬取方式:re(正则表达式) & BeautifulSoup & lxml

性能对比:比较网页内容解析运行所耗费的时间

 

import requests
import re
from bs4 import BeautifulSoup
from lxml import etree
import time

##正则表达式
def re_info(r):
    ids = re.findall("<h2>(.*?)</h2>",r.text,re.S)       
    contents = re.findall('<div class="content">.*?<span>(.*?)</span>',r.text,re.S)
    laughs = re.findall('<span class="stats-vote">.*?<i class="number">(.*?)</i>',r.text,re.S)
    comments = re.findall('<span class="stats-comments">.*?<i class="number">(.*?)</i>',r.text,re.S)
    return [ids,contents,laughs,comments]

##BeautifulSoup
def bs4_info(r):
    soup = BeautifulSoup(r.text,"lxml")
    infos = soup.select("div.article")
    for info in infos:
        id = info.select("h2")[0].text.strip()
        content = info.select("div.content")[0].text.strip()
        laugh = info.select("span.stats-vote i")[0].text
        comment = info.select("span.stats-comments i")[0].text
        return [id,content,laugh,comment]
    
#lxml    
def lxml_info(r):
    html = etree.HTML(r.text)
    infos = html.xpath('//div[starts-with(@class,"article block untagged mb15")]')
    for info in infos:
        id = info.xpath('div[1]//h2/text()')[0]
        content = info.xpath('a[1]/div/span/text()')[0].strip()  #复制xpath时需添加/span标签
        laugh = info.xpath('div[2]/span[1]/i/text()')[0]
        comment = info.xpath('div[2]/span[2]/a/i/text()')[0]
        return [id,content,laugh,comment]

if __name__ == "__main__":
    url_list = ["https://www.qiushibaike.com/text/page/{}/".format(i) for i in range(1,14)]
    hds = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3294.6 Safari/537.36'}
    for name,get_info in [('re',re_info),('bs4',bs4_info),('lxml',lxml_info)]:
        start = time.time()
        for url in url_list:
            r = requests.get(url,headers = hds)
            get_info(r)
        stop = time.time()
        print(name,stop-start)

 

运行结果:正则表达式和Lxml的运行时间都比较快,BS4较慢。所以当数据量较大时,推荐使用Lxml。

不过,lxml的路径兼容性似乎较弱,尝试使用“//”时出错的可能性较大,最好列出完整路径,例如:div[2]/span[1]/i/text()。

re 2.6481516361236572
bs4 4.277244567871094
lxml 2.4631409645080566

 

 

参考推荐

爬虫常见的网页解析工具:lxml / xpath 与 bs4 / BeautifulSoup

正则表达式、BeautifulSoup、Lxml性能对比实例 (简书)

Java 破解谷歌翻译 免费 api 调用