米扑博客在上文《PHP 获取网页标题(title)、描述(description)、关键字(keywords)等meta信息》总结了常用的提取网页内容方法,通过正则表达式可以解决90%的网页爬取问题,剩下的10%的问题,包含反爬虫技术而无法获取网页内容,从而无法使用正则表达式,因此本文将解决这10%中的5%问题:PHP + Selenium + WebDriver 抓取网页内容。

阅读了本文,也只能解决抓取网页的95%的问题,剩下的5%的问题,在本文最后的总结里给出,米扑代理有其解决方案。

 

PHP + Selenium + WebDriver 开发环境搭建

如上图,执行顺序: PHP ——> PHP-Webdriver ——> Selenium ——> Firefox / Chrome

PHP 通过 PHP-Webdriver 操作通知 Selenium,然后由 Selenium 操作浏览器 Firefox / Chrome

PHP-Webdriver 是由facebook维护的selenium插件,用于通过php来和selenium通信,可用composer来安装

PHP-Webdriver 官网:https://github.com/facebook/php-webdriver (github)

 

0. 系统环境

Mac OS 10.13.2 、Ubuntu 14.04 、CentOS 7.2 (都配置成功了

PHP 5.6.30  和  PHP 7.2.0

selenium-server-standalone-3.4.0.jar

firefox geckodriver (github)

php-webdriver(facebook)

chromedriver_mac64.zip  v2.30(taobao)

chrome-mac_v65.0.3304.0.zip (chromium,不同于Google Chrome)

chromium 官网http://www.chromium.org/chromium-os

firefox 老版本下载:http://ftp.mozilla.org/pub/firefox/releases/   (推荐

 

1) Mac OS 系统环境

 

 

2) Ubuntu 14.04 系统环境

selenium-server-standalone-3.8.0.jar

$ sudo apt-get -y install firefox
$ sudo apt-get -y install google-chrome-stable
$ sudo apt-get -y install chromium-browser
$
$ firefox -v
Mozilla Firefox 57.0.3
$
$ geckodriver -v
1514872602447   geckodriver     INFO    geckodriver 0.19.0
1514872602448   webdriver::httpapi      DEBUG   Creating routes
1514872602459   geckodriver     INFO    Listening on 127.0.0.1:4444
$
$ chrome -version
Google Chrome 63.0.3239.108 
$
$ chromedriver -v
ChromeDriver 2.34.522913 (36222509aa6e819815938cbf2709b4849735537c)

Ubuntu 14.04 启动 selenium

success now
DISPLAY=:1 xvfb-run java -jar selenium-server-standalone-3.8.0.jar -port 8888

success old
/usr/bin/Xvfb :7 -ac -screen 0 1024x768x8
java -jar selenium-server-standalone-3.8.0.jar -port 8888

test
http://localhost:8888/wd/hub

 

selenium之 chromedriver与chrome版本映射表

chromedriver版本 支持的Chrome版本
v2.34 v61-63
v2.33 v60-62
v2.32 v59-61
v2.31 v58-60
v2.30 v58-60
v2.29 v56-58
v2.28 v55-57
v2.27 v54-56
v2.26 v53-55
v2.25 v53-55
v2.24 v52-54
v2.23 v51-53
v2.22 v49-52
v2.21 v46-50
v2.20 v43-48
v2.19 v43-47
v2.18 v43-46
v2.17 v42-43
v2.13 v42-45
v2.15 v40-43
v2.14 v39-42
v2.13 v38-41
v2.12 v36-40
v2.11 v36-40
v2.10 v33-36
v2.9 v31-34
v2.8 v30-33
v2.7 v30-33
v2.6 v29-32
v2.5 v29-32
v2.4 v29-32

 

selenium + php-webdriver 支持的浏览器,详见 WebDriverBrowserType.php

    const FIREFOX = 'firefox';
    const FIREFOX_PROXY = 'firefoxproxy';
    const FIREFOX_CHROME = 'firefoxchrome';
    const GOOGLECHROME = 'googlechrome';
    const SAFARI = 'safari';
    const SAFARI_PROXY = 'safariproxy';
    const OPERA = 'opera';
    const MICROSOFT_EDGE = 'MicrosoftEdge';
    const IEXPLORE = 'iexplore';
    const IEXPLORE_PROXY = 'iexploreproxy';
    const CHROME = 'chrome';
    const KONQUEROR = 'konqueror';
    const MOCK = 'mock';
    const IE_HTA = 'iehta';
    const ANDROID = 'android';
    const HTMLUNIT = 'htmlunit';
    const IE = 'internet explorer';
    const IPHONE = 'iphone';
    const IPAD = 'iPad';
    const PHANTOMJS = 'phantomjs';

 

1. 安装 composer 和 selenium

1)创建安装目录

sudo mkdir /opt/php-selenium
chown -R homer:staff /opt/php-selenium
cd /opt/php-selenium/

 

2)下载 composer.phar

curl -sS https://getcomposer.org/installer | php

 

3)创建 composer.json

composer.phar 安装需要先创建 composer.json 文件

vim composer.json

添加如下内容

{
  "require": {
    "facebook/webdriver": "dev-master",
    "phpunit/phpunit": "*"
  }
}

安装 webdriver 和 phpunit,前者用于连接 chrome、firefox等浏览器,后者是PHP测试工具

安装过程中,输出内容如下,则表示安装成功

$ composer.phar install
Loading composer repositories with package information
Updating dependencies (including require-dev)
Package operations: 26 installs, 0 updates, 0 removals
  - Installing facebook/webdriver (dev-master 575600d): Cloning 575600dfcf from cache
  - Installing symfony/yaml (v3.4.2): Loading from cache
  - Installing sebastian/version (2.0.1): Loading from cache
  - Installing sebastian/resource-operations (1.0.0): Loading from cache
  - Installing sebastian/recursion-context (2.0.0): Loading from cache
  - Installing sebastian/object-enumerator (2.0.1): Loading from cache
  - Installing sebastian/global-state (1.1.1): Loading from cache
  - Installing sebastian/exporter (2.0.0): Loading from cache
  - Installing sebastian/environment (2.0.0): Loading from cache
  - Installing sebastian/diff (1.4.3): Loading from cache
  - Installing sebastian/comparator (1.2.4): Loading from cache
  - Installing doctrine/instantiator (1.0.5): Loading from cache
  - Installing phpunit/php-text-template (1.2.1): Loading from cache
  - Installing phpunit/phpunit-mock-objects (3.4.4): Loading from cache
  - Installing phpunit/php-timer (1.0.9): Loading from cache
  - Installing phpunit/php-file-iterator (1.4.5): Loading from cache
  - Installing sebastian/code-unit-reverse-lookup (1.0.1): Loading from cache
  - Installing phpunit/php-token-stream (1.4.12): Loading from cache
  - Installing phpunit/php-code-coverage (4.0.8): Loading from cache
  - Installing webmozart/assert (1.2.0): Loading from cache
  - Installing phpdocumentor/reflection-common (1.0.1): Loading from cache
  - Installing phpdocumentor/type-resolver (0.4.0): Loading from cache
  - Installing phpdocumentor/reflection-docblock (3.3.2): Loading from cache
  - Installing phpspec/prophecy (1.7.3): Loading from cache
  - Installing myclabs/deep-copy (1.7.0): Loading from cache
  - Installing phpunit/phpunit (5.7.26): Loading from cache
symfony/yaml suggests installing symfony/console (For validating YAML files using the lint command)
sebastian/global-state suggests installing ext-uopz (*)
phpunit/phpunit suggests installing phpunit/php-invoker (~1.1)
Writing lock file
Generating autoload files

 

4)查看安装生成的文件

$ ll /opt/php-selenium
-rw-r--r--   1 homer  staff      102 12 24 18:31 composer.json
-rw-r--r--   1 homer  staff    48580 12 24 18:33 composer.lock
-rwxr-xr-x   1 homer  staff  1855013 12 24 18:26 composer.phar
drwxr-xr-x  14 homer  staff      476 12 24 18:33 vendor

可见,composer 会自动生成一个文件夹 vendor ,其内容如下:

$ ll /opt/php-selenium/vendor/
-rw-r--r--   1 homer  staff  178 12 24 18:33 autoload.php
drwxr-xr-x   3 homer  staff  102 12 24 18:33 bin
drwxr-xr-x  11 homer  staff  374 12 24 18:33 composer
drwxr-xr-x   3 homer  staff  102 12 24 18:33 doctrine
drwxr-xr-x   3 homer  staff  102 12 24 18:33 facebook
drwxr-xr-x   3 homer  staff  102 12 24 18:33 myclabs
drwxr-xr-x   5 homer  staff  170 12 24 18:33 phpdocumentor
drwxr-xr-x   3 homer  staff  102 12 24 18:33 phpspec
drwxr-xr-x   9 homer  staff  306 12 24 18:33 phpunit
drwxr-xr-x  12 homer  staff  408 12 24 18:33 sebastian
drwxr-xr-x   3 homer  staff  102 12 24 18:33 symfony
drwxr-xr-x   3 homer  staff  102 12 24 18:33 webmozart

可见,生成了一个PHP文件 vendor/autoload.php 

这个文件非常重要,下文的PHP引用 webdriver 将必须引用此文件

 

 

2. 下载 selenium-server

selenium-server 下载网址:http://selenium-release.storage.googleapis.com/index.html

下载比较低的版本,米扑科技下载的版本为:selenium-server-standalone-3.4.0.jar

在后台启动 selenium-server 服务命令:

java -jar selenium-server-standalone-3.4.0.jar 

默认运行在 4444 端口上,可打开如下网址查看:

http://localhost:4444/wd/hub

如果需自定义端口,可执行命令:

java -jar selenium-server-standalone-3.4.0.jar -port 8888

运行输出内容如下:

$ java -jar selenium-server-standalone-3.4.0.jar -port 8888
19:08:20.176 INFO - Selenium build info: version: '3.4.0', revision: 'unknown'
19:08:20.177 INFO - Launching a standalone Selenium Server
2017-12-24 19:08:20.205:INFO::main: Logging initialized @273ms to org.seleniumhq.jetty9.util.log.StdErrLog
19:08:20.262 INFO - Driver provider org.openqa.selenium.ie.InternetExplorerDriver registration is skipped:
 registration capabilities Capabilities [{ensureCleanSession=true, browserName=internet explorer, version=, platform=WINDOWS}] does not match the current platform MAC
19:08:20.263 INFO - Driver provider org.openqa.selenium.edge.EdgeDriver registration is skipped:
 registration capabilities Capabilities [{browserName=MicrosoftEdge, version=, platform=WINDOWS}] does not match the current platform MAC
19:08:20.263 INFO - Driver class not found: com.opera.core.systems.OperaDriver
19:08:20.263 INFO - Driver provider com.opera.core.systems.OperaDriver registration is skipped:
Unable to create new instances on this machine.
19:08:20.263 INFO - Driver class not found: com.opera.core.systems.OperaDriver
19:08:20.263 INFO - Driver provider com.opera.core.systems.OperaDriver is not registered
2017-12-24 19:08:20.314:INFO:osjs.Server:main: jetty-9.4.3.v20170317
2017-12-24 19:08:20.358:INFO:osjsh.ContextHandler:main: Started o.s.j.s.ServletContextHandler@d8355a8{/,null,AVAILABLE}
2017-12-24 19:08:20.404:INFO:osjs.AbstractConnector:main: Started ServerConnector@146ba0ac{HTTP/1.1,[http/1.1]}{0.0.0.0:8888}
2017-12-24 19:08:20.405:INFO:osjs.Server:main: Started @473ms
19:08:20.405 INFO - Selenium Server is up and running

更多参数,可执行如下命令查看:

java -jar selenium-server-standalone-3.4.0.jar --help

 

 

3. 安装 firefox 和 chrome

selenium 支持firefox、chrome等多个浏览器,本文将介绍firefox和chrome,二者选其一

1)安装 firefox

米扑科技安装的 firefox 版本为 firefox-45.0.2.tar.bz2 + selenium-server-standalone-3.4.0.jar

解压 firefox-45.0.2.tar.bz2 后,然后软链接到 bin 目录

ln -s /Users/homer/Downloads/myApp/firefox /usr/local/bin/firefox

Mac OS 系统默认安装的 firefox 路径为:

$ ps -ef | grep firefox
 /Applications/Firefox.app/Contents/MacOS/firefox

firefox 版本与 selenium 有对应关系,一般不要轻易升级 firefox,否则selenium会报错找不到 firefox

 

2)安装 chrome 推荐

a)chromedriver 下载网址:http://npm.taobao.org/mirrors/chromedriver/ (淘宝镜像)

下载 chromedriver_mac64.zip  v2.30

b)解压后,文件名为:chromedriver

c)拷贝到bin目录下:

cp /opt/php-selenium/chromedriver /usr/local/bin/chromedriver

d)开放其权限(重要,折腾了好久才发现此问题

因Mac对非App Store下载的都没有授权,因此需要授权 chromedriver

Mac OS 设置 ——> 安全与隐私 ——> 通用 ——> 输入密码,允许访问

 

 

4. PHP + Selenium + Firefox 抓取实例

1)首先,启动 selenium-server

为了防止别人知道默认端口号4444,这里可以修改指定端口号8888

java -jar selenium-server-standalone-3.4.0.jar -port 8888

 

2)引用步骤1.4)步骤的PHP文件

require_once('/opt/php-selenium/vendor/autoload.php');

 

3)抓取米扑科技首页

<?php
require_once('/opt/php-selenium/vendor/autoload.php');
header("content-type:text/html; charset=xxx");

$url = "https://mimvp.com";		// 默认 utf-8

$res_array = array();
try {
	$driver = RemoteWebDriver::create('http://localhost:8888/wd/hub', DesiredCapabilities::firefox());
	$driver->get($url);
	
	$curr_url = $driver->getCurrentURL();			// 当前网址
	$page_source = $driver->getPageSource();			// 网页内容
	$title = $driver->getTitle();					// 网页标题
	$cookie = $driver->manage()->getCookies();		// cookie
	$driver->takeScreenshot('../results/screenshot-'.date('Y-m-d__H:i:s').'.png');		// 截图网页,保存到路径
	
	$res_array['curr_url'] = $curr_url;
	$res_array['page_source'] = $page_source;
	$res_array['title'] = $title;
	$res_array['cookie'] = $cookie;
	$res_array['page_info'] = get_page_info($page_source);
	
	$driver->quit();
} catch (Exception $e) {
	echo "error msg : " . $e->getMessage();
}
var_dump($res_array);

运行结果(部分):

array (size=5)
  'curr_url' => string 'https://mimvp.com/' (length=18)
  'page_source' => string '<html lang="zh-CN"><head>
<meta charset="utf-8">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
<meta http-equiv="Cache-Control" content="no-transform"> 
<meta http-equiv="Cache-Control" content="no-siteapp">
<meta name="viewport" content="width=device-width,initial-scale=1.0,minimum-scale=1.0,maximum-scale=1.0,user-scalable=no">
<meta name="applicable-device" content="pc,mobile">
<meta name="format-detection" content="telep'... (length=21647)
  'title' => string '米扑科技 - 简单可信赖' (length=30)
  'cookie' => 
    array (size=4)
      0 => 
        array (size=8)
          'path' => string '/' (length=1)
          'domain' => string '.mimvp.com' (length=10)
          'name' => string 'Hm_lvt_2470f08b0a4e8514a3d12a641ddcb46d' (length=39)
          'httpOnly' => boolean false
          'hCode' => int -2055227810
          'secure' => boolean false
          'value' => string '1514115610' (length=10)
          'class' => string 'org.openqa.selenium.Cookie' (length=26)
      。。。。。。
  'page_info' => 
    array (size=7)
      'site_title' => string '米扑科技 - 简单可信赖' (length=30)
      'site_description' => string '米扑科技,小而美、简而信,工匠艺术的互联网服务。' (length=72)
      'site_keywords' => string '米扑科技,米扑,mimvp.com,mimvp,米扑代理,米扑域名,米扑财富,米扑支付,米扑活动,米扑学堂,米扑博客,米扑论坛,小而美,简而信,简单可信赖' (length=175)
      'friend_link_status' => int 0
      'site_claim_status' => int 1
      'site_home_size' => int 21647
      'meta_array' => 
        array (size=24)
          'content-type' => string 'text/html; charset=utf-8' (length=24)
          'description' => string '米扑科技,小而美、简而信,工匠艺术的互联网服务。' (length=72)
          'keywords' => string '米扑科技,米扑,mimvp.com,mimvp,米扑代理,米扑域名,米扑财富,米扑支付,米扑活动,米扑学堂,米扑博客,米扑论坛,小而美,简而信,简单可信赖' (length=175)
          'author' => string '米扑科技' (length=12)
          'version' => string 'mimvp-home-1.2' (length=14)
          'copyright' => string '2009-2017 by mimvp.com' (length=22)
          'baidu_union_verify' => string '20be1643a0f542b29d54e5137bea4225' (length=32)
          'mimvp-site-verification' => string 'I-love-mimvp.com-from-20160912' (length=30)
          'baidu-site-verification' => string 'pzH9C12mmf' (length=10)
          'sogou_site_verification' => string 'QCi6brPm84' (length=10)
          '360-site-verification' => string 'd42818ef57d4f110b6c1fdf268c8cb07' (length=32)
          'shenma-site-verification' => string 'f85fa0493059ca7e6b73ad5ae44751ec_1498128383' (length=43)
          'google-site-verification' => string 'DSE-4k0kg0zlz8aGyKmZImOoTkpiIreULTsgMwNqJYE' (length=43)
          'msvalidate.01' => string '7B03EDC84171290ABCCF8E6F2DA645B1' (length=32)
          'baiduspider' => string 'index,follow' (length=12)
          'googlebot' => string 'index,follow' (length=12)
          'bingbot' => string 'index,follow' (length=12)
          'robots' => string 'index,follow' (length=12)

 

运行结果保存的截图:

screenshot-2017-12-24__23:12:07__mimvp.com.png

米扑科技首页https://mimvp.com

 

php + selenium + chrome 打开获取网页

<?php
require_once('/opt/php-selenium/vendor/autoload.php');
header("content-type:text/html; charset=xxx");

$url = "https://mimvp.com";			// 默认 utf-8
$url = "http://www.qq.com";			// 默认 gb2312,需添加 header,否则乱码
$url = "https://www.dajie.com";		// content在前,name在后,匹配错误(从第一个content开始,从最后一个name结束)
$url = "http://dytt8.net";			// 无法获得网页编码,则用 json_encode( $output ) == '' 检测网页乱码
$url = 'http://esf.sh.fang.com';		// curl无法爬取,firefox解析
$url = 'http://www.shfq.com';		// curl无法爬取,firefox解析
$url = 'https://www.guazi.com';		// curl无法爬取,firefox无法解析,chrome才可解析


$res_array = array();
try {
// 	$driver = RemoteWebDriver::create('http://localhost:8888/wd/hub', DesiredCapabilities::firefox());
	
// 	putenv(ChromeDriverService::CHROME_DRIVER_EXE_PROPERTY . '=' . getenv('/usr/local/bin/chromedriver'));
	$driver = RemoteWebDriver::create('http://localhost:8888/wd/hub', DesiredCapabilities::chrome());

	$driver->get($url);
	
	$curr_url = $driver->getCurrentURL();			// 当前网址
	$page_source = $driver->getPageSource();			// 网页内容
	$title = $driver->getTitle();					// 网页标题
	$cookie = $driver->manage()->getCookies();		// cookie
	
	// 保存源码
	$txt_result_filename = sprintf("../results/result_txt-%s__%s.txt", date('Y-m-d__H:i:s'), explode("/", $url)[2]);
	file_put_contents($txt_result_filename, $page_source);
	
	// 保存截图
	$save_path = sprintf("../results/screenshot-%s__%s.png", date('Y-m-d__H:i:s'), explode("/", $url)[2]);
	$driver->takeScreenshot($save_path);		// 截图网页,保存到路径
	
	$res_array['curr_url'] = $curr_url;
	$res_array['page_source'] = $page_source;
	$res_array['title'] = $title;
	$res_array['page_info'] = get_page_info3($page_source);
	
	$driver->quit();
} catch (Exception $e) {
	echo "error msg : " . $e->getMessage();
}
var_dump($res_array);
?>

 

上面代码,通过启动 firefox 浏览器爬取网页内容,具体实现功能如下:

a)PHP + Selenium + Firefox 通过 Firefox 浏览器打开网页,并抓取其内容,解决了反爬虫的

b)引用了 require_once('/opt/php-selenium/vendor/autoload.php') 才会能调用 RemoteWebDriver::create 等

c)selenium-server 启动了 8888 端口,driver 创建也连接 8888 网址 'http://localhost:8888/wd/hub'

d)获取网页内容 $driver->getPageSource() 、获取网页标题 $driver->getTitle() 等,详见 WebDriver.php

e)webdriver 提供了截取网页截图的功能,函数为 $driver->takeScreenshot($save_as = 'xxx.png');

f)webdriver 打开 firefox 抓取网页内容完毕后,需要关闭,否则会一直打开 $driver->quit();

webdriver 还有许多函数,如选择过滤节点、元素等,详见 WebDriver.php

 

 

问题与解决

问题1:

运行命令: java -jar selenium-server-standalone-3.8.0.jar -port 8888

提示错误: Error: GDK_BACKEND does not match available displays

运行命令:

/usr/bin/Xvfb :7 -ac -screen 0 1024x768x8 &
DISPLAY=:1 java -jar selenium-server-standalone-3.8.0.jar -port 8888

提示错误:Error: cannot open display: :1

原因分析:

没有可供显示的窗体,详见 How do I run Selenium in Xvfb

问题解决:

DISPLAY=:1 xvfb-run java -jar selenium-server-standalone-3.8.0.jar -port 8888

/usr/bin/Xvfb :7 -ac -screen 0 1024x768x8 &
export  DISPLAY=:7
java -jar selenium-server-standalone-3.8.0.jar -port 8888

问题解决,可以在虚拟窗口,成功打开 firefox (Mozilla Firefox 57.0.3) 和 chrome (Google Chrome 63.0.3239.108 )

完整的启动脚本如下:

sudo vim /etc/init.d/xvfb

#! /bin/bash  
if [ -z "$1" ]; then   
    echo "`basename $0` {start|stop}"  
    exit  
fi  

case "$1" in  
start)  
#/usr/bin/Xvfb :7 -ac -screen 0 1024x768x8 &  
DISPLAY=:1 /usr/bin/xvfb-run java -jar selenium-server-standalone-3.8.0.jar -port 8888
;;  
stop)  
killall Xvfb  
;;  
esac  

设置开机启动:

sudo chkconfig xvfb on

 

问题2:

Ubuntu 14.04 没有 chkconfig,无法开启开机启动

解决方法:

安装 sysv-rc-conf
# sudo apt-get install sysv-rc-conf

链接 chkconfig
# sudo cp /usr/sbin/sysv-rc-conf /usr/sbin/chkconfig

测试 chkconfig 
# chkconfig --list | grep xvfb
xvfb         2:on       3:on    4:on    5:on

 

 

总结

PHP 爬取网页,通过 curl 和 webdriver + selenium 基本可以爬取95%网页

PHP 解析网页,通过 正则表达式 和 HTML Dom 解析,基本可以爬取网页的任何内容(详见米扑博客

PHP 解决乱码,米扑博客已经给出了四种解决方案,基本可解决99%的编码问题(详见米扑博客

最后,还有5%网页无法抓取,主要是后台工程师的反屏蔽、反爬虫,并非是不能打开网页

因此,为了绕过5%反爬虫、反爬取,这里就用到了代理,这里推荐 米扑代理

米扑科技每天爬取大量网页内容,积累了丰富的反爬虫经验,也用到了大量代理,也开放了代理给大家使用

详见米扑代理:https://proxy.mimvp.com

 

 

参考推荐

PHP 获取网页标题(title)、描述(description)、关键字(keywords)等meta信息

Python+Selenium2 搭建自动化测试环境

selenium实现Xvfb在linux上无界面运行

Python + Selenium2 + Chrome 爬取网页

selenium+php-webdriver实现抓取淘宝页面