PHP + Selenium + WebDriver 抓取米扑科技首页
米扑博客在上文《PHP 获取网页标题(title)、描述(description)、关键字(keywords)等meta信息》总结了常用的提取网页内容方法,通过正则表达式可以解决90%的网页爬取问题,剩下的10%的问题,包含反爬虫技术而无法获取网页内容,从而无法使用正则表达式,因此本文将解决这10%中的5%问题:PHP + Selenium + WebDriver 抓取网页内容。
阅读了本文,也只能解决抓取网页的95%的问题,剩下的5%的问题,在本文最后的总结里给出,米扑代理有其解决方案。
PHP + Selenium + WebDriver 开发环境搭建
如上图,执行顺序: PHP ——> PHP-Webdriver ——> Selenium ——> Firefox / Chrome
PHP 通过 PHP-Webdriver 操作通知 Selenium,然后由 Selenium 操作浏览器 Firefox / Chrome
PHP-Webdriver 是由facebook维护的selenium插件,用于通过php来和selenium通信,可用composer来安装
PHP-Webdriver 官网:https://github.com/facebook/php-webdriver (github)
0. 系统环境
Mac OS 10.13.2 、Ubuntu 14.04 、CentOS 7.2 (都配置成功了)
PHP 5.6.30 和 PHP 7.2.0
selenium-server-standalone-3.4.0.jar
firefox geckodriver (github)
php-webdriver(facebook)
chromedriver_mac64.zip v2.30(taobao)
chrome-mac_v65.0.3304.0.zip (chromium,不同于Google Chrome)
chromium 官网:http://www.chromium.org/chromium-os
firefox 老版本下载:http://ftp.mozilla.org/pub/firefox/releases/ (推荐)
1) Mac OS 系统环境
2) Ubuntu 14.04 系统环境
selenium-server-standalone-3.8.0.jar
$ sudo apt-get -y install firefox $ sudo apt-get -y install google-chrome-stable $ sudo apt-get -y install chromium-browser $ $ firefox -v Mozilla Firefox 57.0.3 $ $ geckodriver -v 1514872602447 geckodriver INFO geckodriver 0.19.0 1514872602448 webdriver::httpapi DEBUG Creating routes 1514872602459 geckodriver INFO Listening on 127.0.0.1:4444 $ $ chrome -version Google Chrome 63.0.3239.108 $ $ chromedriver -v ChromeDriver 2.34.522913 (36222509aa6e819815938cbf2709b4849735537c)
Ubuntu 14.04 启动 selenium
success now
DISPLAY=:1 xvfb-run java -jar selenium-server-standalone-3.8.0.jar -port 8888
success old
/usr/bin/Xvfb :7 -ac -screen 0 1024x768x8
java -jar selenium-server-standalone-3.8.0.jar -port 8888
test
http://localhost:8888/wd/hub
selenium之 chromedriver与chrome版本映射表
chromedriver版本 | 支持的Chrome版本 |
---|---|
v2.34 | v61-63 |
v2.33 | v60-62 |
v2.32 | v59-61 |
v2.31 | v58-60 |
v2.30 | v58-60 |
v2.29 | v56-58 |
v2.28 | v55-57 |
v2.27 | v54-56 |
v2.26 | v53-55 |
v2.25 | v53-55 |
v2.24 | v52-54 |
v2.23 | v51-53 |
v2.22 | v49-52 |
v2.21 | v46-50 |
v2.20 | v43-48 |
v2.19 | v43-47 |
v2.18 | v43-46 |
v2.17 | v42-43 |
v2.13 | v42-45 |
v2.15 | v40-43 |
v2.14 | v39-42 |
v2.13 | v38-41 |
v2.12 | v36-40 |
v2.11 | v36-40 |
v2.10 | v33-36 |
v2.9 | v31-34 |
v2.8 | v30-33 |
v2.7 | v30-33 |
v2.6 | v29-32 |
v2.5 | v29-32 |
v2.4 | v29-32 |
selenium + php-webdriver 支持的浏览器,详见 WebDriverBrowserType.php
const FIREFOX = 'firefox'; const FIREFOX_PROXY = 'firefoxproxy'; const FIREFOX_CHROME = 'firefoxchrome'; const GOOGLECHROME = 'googlechrome'; const SAFARI = 'safari'; const SAFARI_PROXY = 'safariproxy'; const OPERA = 'opera'; const MICROSOFT_EDGE = 'MicrosoftEdge'; const IEXPLORE = 'iexplore'; const IEXPLORE_PROXY = 'iexploreproxy'; const CHROME = 'chrome'; const KONQUEROR = 'konqueror'; const MOCK = 'mock'; const IE_HTA = 'iehta'; const ANDROID = 'android'; const HTMLUNIT = 'htmlunit'; const IE = 'internet explorer'; const IPHONE = 'iphone'; const IPAD = 'iPad'; const PHANTOMJS = 'phantomjs';
1. 安装 composer 和 selenium
1)创建安装目录
sudo mkdir /opt/php-selenium
chown -R homer:staff /opt/php-selenium
cd /opt/php-selenium/
2)下载 composer.phar
curl -sS https://getcomposer.org/installer | php
3)创建 composer.json
composer.phar 安装需要先创建 composer.json 文件
vim composer.json
添加如下内容
{ "require": { "facebook/webdriver": "dev-master", "phpunit/phpunit": "*" } }
安装 webdriver 和 phpunit,前者用于连接 chrome、firefox等浏览器,后者是PHP测试工具
安装过程中,输出内容如下,则表示安装成功
$ composer.phar install Loading composer repositories with package information Updating dependencies (including require-dev) Package operations: 26 installs, 0 updates, 0 removals - Installing facebook/webdriver (dev-master 575600d): Cloning 575600dfcf from cache - Installing symfony/yaml (v3.4.2): Loading from cache - Installing sebastian/version (2.0.1): Loading from cache - Installing sebastian/resource-operations (1.0.0): Loading from cache - Installing sebastian/recursion-context (2.0.0): Loading from cache - Installing sebastian/object-enumerator (2.0.1): Loading from cache - Installing sebastian/global-state (1.1.1): Loading from cache - Installing sebastian/exporter (2.0.0): Loading from cache - Installing sebastian/environment (2.0.0): Loading from cache - Installing sebastian/diff (1.4.3): Loading from cache - Installing sebastian/comparator (1.2.4): Loading from cache - Installing doctrine/instantiator (1.0.5): Loading from cache - Installing phpunit/php-text-template (1.2.1): Loading from cache - Installing phpunit/phpunit-mock-objects (3.4.4): Loading from cache - Installing phpunit/php-timer (1.0.9): Loading from cache - Installing phpunit/php-file-iterator (1.4.5): Loading from cache - Installing sebastian/code-unit-reverse-lookup (1.0.1): Loading from cache - Installing phpunit/php-token-stream (1.4.12): Loading from cache - Installing phpunit/php-code-coverage (4.0.8): Loading from cache - Installing webmozart/assert (1.2.0): Loading from cache - Installing phpdocumentor/reflection-common (1.0.1): Loading from cache - Installing phpdocumentor/type-resolver (0.4.0): Loading from cache - Installing phpdocumentor/reflection-docblock (3.3.2): Loading from cache - Installing phpspec/prophecy (1.7.3): Loading from cache - Installing myclabs/deep-copy (1.7.0): Loading from cache - Installing phpunit/phpunit (5.7.26): Loading from cache symfony/yaml suggests installing symfony/console (For validating YAML files using the lint command) sebastian/global-state suggests installing ext-uopz (*) phpunit/phpunit suggests installing phpunit/php-invoker (~1.1) Writing lock file Generating autoload files
4)查看安装生成的文件
$ ll /opt/php-selenium -rw-r--r-- 1 homer staff 102 12 24 18:31 composer.json -rw-r--r-- 1 homer staff 48580 12 24 18:33 composer.lock -rwxr-xr-x 1 homer staff 1855013 12 24 18:26 composer.phar drwxr-xr-x 14 homer staff 476 12 24 18:33 vendor
可见,composer 会自动生成一个文件夹 vendor ,其内容如下:
$ ll /opt/php-selenium/vendor/ -rw-r--r-- 1 homer staff 178 12 24 18:33 autoload.php drwxr-xr-x 3 homer staff 102 12 24 18:33 bin drwxr-xr-x 11 homer staff 374 12 24 18:33 composer drwxr-xr-x 3 homer staff 102 12 24 18:33 doctrine drwxr-xr-x 3 homer staff 102 12 24 18:33 facebook drwxr-xr-x 3 homer staff 102 12 24 18:33 myclabs drwxr-xr-x 5 homer staff 170 12 24 18:33 phpdocumentor drwxr-xr-x 3 homer staff 102 12 24 18:33 phpspec drwxr-xr-x 9 homer staff 306 12 24 18:33 phpunit drwxr-xr-x 12 homer staff 408 12 24 18:33 sebastian drwxr-xr-x 3 homer staff 102 12 24 18:33 symfony drwxr-xr-x 3 homer staff 102 12 24 18:33 webmozart
可见,生成了一个PHP文件 vendor/autoload.php
这个文件非常重要,下文的PHP引用 webdriver 将必须引用此文件
2. 下载 selenium-server
selenium-server 下载网址:http://selenium-release.storage.googleapis.com/index.html
下载比较低的版本,米扑科技下载的版本为:selenium-server-standalone-3.4.0.jar
在后台启动 selenium-server 服务命令:
java -jar selenium-server-standalone-3.4.0.jar
默认运行在 4444 端口上,可打开如下网址查看:
如果需自定义端口,可执行命令:
java -jar selenium-server-standalone-3.4.0.jar -port 8888
运行输出内容如下:
$ java -jar selenium-server-standalone-3.4.0.jar -port 8888 19:08:20.176 INFO - Selenium build info: version: '3.4.0', revision: 'unknown' 19:08:20.177 INFO - Launching a standalone Selenium Server 2017-12-24 19:08:20.205:INFO::main: Logging initialized @273ms to org.seleniumhq.jetty9.util.log.StdErrLog 19:08:20.262 INFO - Driver provider org.openqa.selenium.ie.InternetExplorerDriver registration is skipped: registration capabilities Capabilities [{ensureCleanSession=true, browserName=internet explorer, version=, platform=WINDOWS}] does not match the current platform MAC 19:08:20.263 INFO - Driver provider org.openqa.selenium.edge.EdgeDriver registration is skipped: registration capabilities Capabilities [{browserName=MicrosoftEdge, version=, platform=WINDOWS}] does not match the current platform MAC 19:08:20.263 INFO - Driver class not found: com.opera.core.systems.OperaDriver 19:08:20.263 INFO - Driver provider com.opera.core.systems.OperaDriver registration is skipped: Unable to create new instances on this machine. 19:08:20.263 INFO - Driver class not found: com.opera.core.systems.OperaDriver 19:08:20.263 INFO - Driver provider com.opera.core.systems.OperaDriver is not registered 2017-12-24 19:08:20.314:INFO:osjs.Server:main: jetty-9.4.3.v20170317 2017-12-24 19:08:20.358:INFO:osjsh.ContextHandler:main: Started o.s.j.s.ServletContextHandler@d8355a8{/,null,AVAILABLE} 2017-12-24 19:08:20.404:INFO:osjs.AbstractConnector:main: Started ServerConnector@146ba0ac{HTTP/1.1,[http/1.1]}{0.0.0.0:8888} 2017-12-24 19:08:20.405:INFO:osjs.Server:main: Started @473ms 19:08:20.405 INFO - Selenium Server is up and running
更多参数,可执行如下命令查看:
java -jar selenium-server-standalone-3.4.0.jar --help
3. 安装 firefox 和 chrome
selenium 支持firefox、chrome等多个浏览器,本文将介绍firefox和chrome,二者选其一
1)安装 firefox
米扑科技安装的 firefox 版本为 firefox-45.0.2.tar.bz2 + selenium-server-standalone-3.4.0.jar
解压 firefox-45.0.2.tar.bz2 后,然后软链接到 bin 目录
ln -s /Users/homer/Downloads/myApp/firefox /usr/local/bin/firefox
Mac OS 系统默认安装的 firefox 路径为:
$ ps -ef | grep firefox
/Applications/Firefox.app/Contents/MacOS/firefox
firefox 版本与 selenium 有对应关系,一般不要轻易升级 firefox,否则selenium会报错找不到 firefox
2)安装 chrome (推荐)
a)chromedriver 下载网址:http://npm.taobao.org/mirrors/chromedriver/ (淘宝镜像)
下载 chromedriver_mac64.zip v2.30
b)解压后,文件名为:chromedriver
c)拷贝到bin目录下:
cp /opt/php-selenium/chromedriver /usr/local/bin/chromedriver
d)开放其权限(重要,折腾了好久才发现此问题)
因Mac对非App Store下载的都没有授权,因此需要授权 chromedriver
Mac OS 设置 ——> 安全与隐私 ——> 通用 ——> 输入密码,允许访问
4. PHP + Selenium + Firefox 抓取实例
1)首先,启动 selenium-server
为了防止别人知道默认端口号4444,这里可以修改指定端口号8888
java -jar selenium-server-standalone-3.4.0.jar -port 8888
2)引用步骤1.4)步骤的PHP文件
require_once('/opt/php-selenium/vendor/autoload.php');
3)抓取米扑科技首页
<?php require_once('/opt/php-selenium/vendor/autoload.php'); header("content-type:text/html; charset=xxx"); $url = "https://mimvp.com"; // 默认 utf-8 $res_array = array(); try { $driver = RemoteWebDriver::create('http://localhost:8888/wd/hub', DesiredCapabilities::firefox()); $driver->get($url); $curr_url = $driver->getCurrentURL(); // 当前网址 $page_source = $driver->getPageSource(); // 网页内容 $title = $driver->getTitle(); // 网页标题 $cookie = $driver->manage()->getCookies(); // cookie $driver->takeScreenshot('../results/screenshot-'.date('Y-m-d__H:i:s').'.png'); // 截图网页,保存到路径 $res_array['curr_url'] = $curr_url; $res_array['page_source'] = $page_source; $res_array['title'] = $title; $res_array['cookie'] = $cookie; $res_array['page_info'] = get_page_info($page_source); $driver->quit(); } catch (Exception $e) { echo "error msg : " . $e->getMessage(); } var_dump($res_array);
运行结果(部分):
array (size=5) 'curr_url' => string 'https://mimvp.com/' (length=18) 'page_source' => string '<html lang="zh-CN"><head> <meta charset="utf-8"> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"> <meta http-equiv="Cache-Control" content="no-transform"> <meta http-equiv="Cache-Control" content="no-siteapp"> <meta name="viewport" content="width=device-width,initial-scale=1.0,minimum-scale=1.0,maximum-scale=1.0,user-scalable=no"> <meta name="applicable-device" content="pc,mobile"> <meta name="format-detection" content="telep'... (length=21647) 'title' => string '米扑科技 - 简单可信赖' (length=30) 'cookie' => array (size=4) 0 => array (size=8) 'path' => string '/' (length=1) 'domain' => string '.mimvp.com' (length=10) 'name' => string 'Hm_lvt_2470f08b0a4e8514a3d12a641ddcb46d' (length=39) 'httpOnly' => boolean false 'hCode' => int -2055227810 'secure' => boolean false 'value' => string '1514115610' (length=10) 'class' => string 'org.openqa.selenium.Cookie' (length=26) 。。。。。。 'page_info' => array (size=7) 'site_title' => string '米扑科技 - 简单可信赖' (length=30) 'site_description' => string '米扑科技,小而美、简而信,工匠艺术的互联网服务。' (length=72) 'site_keywords' => string '米扑科技,米扑,mimvp.com,mimvp,米扑代理,米扑域名,米扑财富,米扑支付,米扑活动,米扑学堂,米扑博客,米扑论坛,小而美,简而信,简单可信赖' (length=175) 'friend_link_status' => int 0 'site_claim_status' => int 1 'site_home_size' => int 21647 'meta_array' => array (size=24) 'content-type' => string 'text/html; charset=utf-8' (length=24) 'description' => string '米扑科技,小而美、简而信,工匠艺术的互联网服务。' (length=72) 'keywords' => string '米扑科技,米扑,mimvp.com,mimvp,米扑代理,米扑域名,米扑财富,米扑支付,米扑活动,米扑学堂,米扑博客,米扑论坛,小而美,简而信,简单可信赖' (length=175) 'author' => string '米扑科技' (length=12) 'version' => string 'mimvp-home-1.2' (length=14) 'copyright' => string '2009-2017 by mimvp.com' (length=22) 'baidu_union_verify' => string '20be1643a0f542b29d54e5137bea4225' (length=32) 'mimvp-site-verification' => string 'I-love-mimvp.com-from-20160912' (length=30) 'baidu-site-verification' => string 'pzH9C12mmf' (length=10) 'sogou_site_verification' => string 'QCi6brPm84' (length=10) '360-site-verification' => string 'd42818ef57d4f110b6c1fdf268c8cb07' (length=32) 'shenma-site-verification' => string 'f85fa0493059ca7e6b73ad5ae44751ec_1498128383' (length=43) 'google-site-verification' => string 'DSE-4k0kg0zlz8aGyKmZImOoTkpiIreULTsgMwNqJYE' (length=43) 'msvalidate.01' => string '7B03EDC84171290ABCCF8E6F2DA645B1' (length=32) 'baiduspider' => string 'index,follow' (length=12) 'googlebot' => string 'index,follow' (length=12) 'bingbot' => string 'index,follow' (length=12) 'robots' => string 'index,follow' (length=12)
运行结果保存的截图:
screenshot-2017-12-24__23:12:07__mimvp.com.png
米扑科技首页:https://mimvp.com
php + selenium + chrome 打开获取网页
<?php require_once('/opt/php-selenium/vendor/autoload.php'); header("content-type:text/html; charset=xxx"); $url = "https://mimvp.com"; // 默认 utf-8 $url = "http://www.qq.com"; // 默认 gb2312,需添加 header,否则乱码 $url = "https://www.dajie.com"; // content在前,name在后,匹配错误(从第一个content开始,从最后一个name结束) $url = "http://dytt8.net"; // 无法获得网页编码,则用 json_encode( $output ) == '' 检测网页乱码 $url = 'http://esf.sh.fang.com'; // curl无法爬取,firefox解析 $url = 'http://www.shfq.com'; // curl无法爬取,firefox解析 $url = 'https://www.guazi.com'; // curl无法爬取,firefox无法解析,chrome才可解析 $res_array = array(); try { // $driver = RemoteWebDriver::create('http://localhost:8888/wd/hub', DesiredCapabilities::firefox()); // putenv(ChromeDriverService::CHROME_DRIVER_EXE_PROPERTY . '=' . getenv('/usr/local/bin/chromedriver')); $driver = RemoteWebDriver::create('http://localhost:8888/wd/hub', DesiredCapabilities::chrome()); $driver->get($url); $curr_url = $driver->getCurrentURL(); // 当前网址 $page_source = $driver->getPageSource(); // 网页内容 $title = $driver->getTitle(); // 网页标题 $cookie = $driver->manage()->getCookies(); // cookie // 保存源码 $txt_result_filename = sprintf("../results/result_txt-%s__%s.txt", date('Y-m-d__H:i:s'), explode("/", $url)[2]); file_put_contents($txt_result_filename, $page_source); // 保存截图 $save_path = sprintf("../results/screenshot-%s__%s.png", date('Y-m-d__H:i:s'), explode("/", $url)[2]); $driver->takeScreenshot($save_path); // 截图网页,保存到路径 $res_array['curr_url'] = $curr_url; $res_array['page_source'] = $page_source; $res_array['title'] = $title; $res_array['page_info'] = get_page_info3($page_source); $driver->quit(); } catch (Exception $e) { echo "error msg : " . $e->getMessage(); } var_dump($res_array); ?>
上面代码,通过启动 firefox 浏览器爬取网页内容,具体实现功能如下:
a)PHP + Selenium + Firefox 通过 Firefox 浏览器打开网页,并抓取其内容,解决了反爬虫的
b)引用了 require_once('/opt/php-selenium/vendor/autoload.php') 才会能调用 RemoteWebDriver::create 等
c)selenium-server 启动了 8888 端口,driver 创建也连接 8888 网址 'http://localhost:8888/wd/hub'
d)获取网页内容 $driver->getPageSource() 、获取网页标题 $driver->getTitle() 等,详见 WebDriver.php
e)webdriver 提供了截取网页截图的功能,函数为 $driver->takeScreenshot($save_as = 'xxx.png');
f)webdriver 打开 firefox 抓取网页内容完毕后,需要关闭,否则会一直打开 $driver->quit();
webdriver 还有许多函数,如选择过滤节点、元素等,详见 WebDriver.php
问题与解决
问题1:
运行命令: java -jar selenium-server-standalone-3.8.0.jar -port 8888
提示错误: Error: GDK_BACKEND does not match available displays
或
运行命令:
/usr/bin/Xvfb :7 -ac -screen 0 1024x768x8 &
DISPLAY=:1 java -jar selenium-server-standalone-3.8.0.jar -port 8888
提示错误:Error: cannot open display: :1
原因分析:
没有可供显示的窗体,详见 How do I run Selenium in Xvfb
问题解决:
DISPLAY=:1 xvfb-run java -jar selenium-server-standalone-3.8.0.jar -port 8888
或
/usr/bin/Xvfb :7 -ac -screen 0 1024x768x8 &
export DISPLAY=:7
java -jar selenium-server-standalone-3.8.0.jar -port 8888
问题解决,可以在虚拟窗口,成功打开 firefox (Mozilla Firefox 57.0.3) 和 chrome (Google Chrome 63.0.3239.108 )
完整的启动脚本如下:
sudo vim /etc/init.d/xvfb
#! /bin/bash if [ -z "$1" ]; then echo "`basename $0` {start|stop}" exit fi case "$1" in start) #/usr/bin/Xvfb :7 -ac -screen 0 1024x768x8 & DISPLAY=:1 /usr/bin/xvfb-run java -jar selenium-server-standalone-3.8.0.jar -port 8888 ;; stop) killall Xvfb ;; esac
设置开机启动:
sudo chkconfig xvfb on
问题2:
Ubuntu 14.04 没有 chkconfig,无法开启开机启动
解决方法:
安装 sysv-rc-conf # sudo apt-get install sysv-rc-conf 链接 chkconfig # sudo cp /usr/sbin/sysv-rc-conf /usr/sbin/chkconfig 测试 chkconfig # chkconfig --list | grep xvfb xvfb 2:on 3:on 4:on 5:on
总结
PHP 爬取网页,通过 curl 和 webdriver + selenium 基本可以爬取95%网页
PHP 解析网页,通过 正则表达式 和 HTML Dom 解析,基本可以爬取网页的任何内容(详见米扑博客)
PHP 解决乱码,米扑博客已经给出了四种解决方案,基本可解决99%的编码问题(详见米扑博客)
最后,还有5%网页无法抓取,主要是后台工程师的反屏蔽、反爬虫,并非是不能打开网页
因此,为了绕过5%反爬虫、反爬取,这里就用到了代理,这里推荐 米扑代理
米扑科技每天爬取大量网页内容,积累了丰富的反爬虫经验,也用到了大量代理,也开放了代理给大家使用
详见米扑代理:https://proxy.mimvp.com
参考推荐:
PHP 获取网页标题(title)、描述(description)、关键字(keywords)等meta信息
Python + Selenium2 + Chrome 爬取网页
selenium+php-webdriver实现抓取淘宝页面
PHP 路径详解 dirname,realpath,__FILE__,getcwd
PHP 文件导入 require, require_once, include, include_once 区别
版权所有: 本文系米扑博客原创、转载、摘录,或修订后发表,最后更新于 2019-12-07 18:21:09
侵权处理: 本个人博客,不盈利,若侵犯了您的作品权,请联系博主删除,莫恶意,索钱财,感谢!