re正则表达式、BeautifulSoup、lxml性能对比

网页内容解析提取，一般用到了 re（正则表达式）、BeautifulSoup、lxml

米扑博客，将在本文将对其进行性能对比测试

爬取内容：用户ID、发表段子文字信息、好笑数量、评价数量

爬取方式：re（正则表达式） & BeautifulSoup & lxml

性能对比：比较网页内容解析运行所耗费的时间

import requests
import re
from bs4 import BeautifulSoup
from lxml import etree
import time

##正则表达式
def re_info(r):
    ids = re.findall("<h2>(.*?)</h2>",r.text,re.S)       
    contents = re.findall('<div class="content">.*?<span>(.*?)</span>',r.text,re.S)
    laughs = re.findall('<span class="stats-vote">.*?<i class="number">(.*?)</i>',r.text,re.S)
    comments = re.findall('<span class="stats-comments">.*?<i class="number">(.*?)</i>',r.text,re.S)
    return [ids,contents,laughs,comments]

##BeautifulSoup
def bs4_info(r):
    soup = BeautifulSoup(r.text,"lxml")
    infos = soup.select("div.article")
    for info in infos:
        id = info.select("h2")[0].text.strip()
        content = info.select("div.content")[0].text.strip()
        laugh = info.select("span.stats-vote i")[0].text
        comment = info.select("span.stats-comments i")[0].text
        return [id,content,laugh,comment]
    
#lxml    
def lxml_info(r):
    html = etree.HTML(r.text)
    infos = html.xpath('//div[starts-with(@class,"article block untagged mb15")]')
    for info in infos:
        id = info.xpath('div[1]//h2/text()')[0]
        content = info.xpath('a[1]/div/span/text()')[0].strip()  #复制xpath时需添加/span标签
        laugh = info.xpath('div[2]/span[1]/i/text()')[0]
        comment = info.xpath('div[2]/span[2]/a/i/text()')[0]
        return [id,content,laugh,comment]

if __name__ == "__main__":
    url_list = ["https://www.qiushibaike.com/text/page/{}/".format(i) for i in range(1,14)]
    hds = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3294.6 Safari/537.36'}
    for name,get_info in [('re',re_info),('bs4',bs4_info),('lxml',lxml_info)]:
        start = time.time()
        for url in url_list:
            r = requests.get(url,headers = hds)
            get_info(r)
        stop = time.time()
        print(name,stop-start)

运行结果：正则表达式和Lxml的运行时间都比较快，BS4较慢。所以当数据量较大时，推荐使用Lxml。

不过，lxml的路径兼容性似乎较弱，尝试使用“//”时出错的可能性较大，最好列出完整路径，例如：div[2]/span[1]/i/text()。

re 2.6481516361236572
bs4 4.277244567871094
lxml 2.4631409645080566

正则表达式、BeautifulSoup、Lxml性能对比实例（简书）

Java 破解谷歌翻译免费 api 调用

米扑博客

Most Valuable Package of Mobile Internet

标签云

打赏赞助

访客统计

分类 (24)

归档 (192)

友情链接

re正则表达式、BeautifulSoup、lxml 性能对比

发表评论