re正则表达式、BeautifulSoup、lxml 性能对比
581 views
0
网页内容解析提取,一般用到了 re(正则表达式)、BeautifulSoup、lxml
米扑博客,将在本文将对其进行性能对比测试
爬取网址:https://mimvp.com
爬取内容:用户ID、发表段子文字信息、好笑数量、评价数量
爬取方式:re(正则表达式) & BeautifulSoup & lxml
性能对比:比较网页内容解析运行所耗费的时间
import requests import re from bs4 import BeautifulSoup from lxml import etree import time ##正则表达式 def re_info(r): ids = re.findall("<h2>(.*?)</h2>",r.text,re.S) contents = re.findall('<div class="content">.*?<span>(.*?)</span>',r.text,re.S) laughs = re.findall('<span class="stats-vote">.*?<i class="number">(.*?)</i>',r.text,re.S) comments = re.findall('<span class="stats-comments">.*?<i class="number">(.*?)</i>',r.text,re.S) return [ids,contents,laughs,comments] ##BeautifulSoup def bs4_info(r): soup = BeautifulSoup(r.text,"lxml") infos = soup.select("div.article") for info in infos: id = info.select("h2")[0].text.strip() content = info.select("div.content")[0].text.strip() laugh = info.select("span.stats-vote i")[0].text comment = info.select("span.stats-comments i")[0].text return [id,content,laugh,comment] #lxml def lxml_info(r): html = etree.HTML(r.text) infos = html.xpath('//div[starts-with(@class,"article block untagged mb15")]') for info in infos: id = info.xpath('div[1]//h2/text()')[0] content = info.xpath('a[1]/div/span/text()')[0].strip() #复制xpath时需添加/span标签 laugh = info.xpath('div[2]/span[1]/i/text()')[0] comment = info.xpath('div[2]/span[2]/a/i/text()')[0] return [id,content,laugh,comment] if __name__ == "__main__": url_list = ["https://www.qiushibaike.com/text/page/{}/".format(i) for i in range(1,14)] hds = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3294.6 Safari/537.36'} for name,get_info in [('re',re_info),('bs4',bs4_info),('lxml',lxml_info)]: start = time.time() for url in url_list: r = requests.get(url,headers = hds) get_info(r) stop = time.time() print(name,stop-start)
运行结果:正则表达式和Lxml的运行时间都比较快,BS4较慢。所以当数据量较大时,推荐使用Lxml。
不过,lxml的路径兼容性似乎较弱,尝试使用“//”时出错的可能性较大,最好列出完整路径,例如:div[2]/span[1]/i/text()。
re 2.6481516361236572
bs4 4.277244567871094
lxml 2.4631409645080566
参考推荐:
爬虫常见的网页解析工具:lxml / xpath 与 bs4 / BeautifulSoup
正则表达式、BeautifulSoup、Lxml性能对比实例 (简书)
版权所有: 本文系米扑博客原创、转载、摘录,或修订后发表,最后更新于 2018-08-05 22:25:19
侵权处理: 本个人博客,不盈利,若侵犯了您的作品权,请联系博主删除,莫恶意,索钱财,感谢!