米扑科技的爬虫项目,最近在使用selenium时,发现driver打开浏览器挺慢的,先后使用过firefox、chrome都感觉很慢。手工打开浏览器访问米扑代理,全部渲染完成,1s内就完成了,而使用driver来打开,一般都得要5-10s的样子,效率确实挺低的。现在可以使用HtmlUnitDriver,它基于HtmlUnit,用Java来模拟一个浏览器,速度很快。

This is currently the fastest and most lightweight implementation of WebDriver. As the name suggests, this is based on HtmlUnit.

优点:

HtmlUnitDriver不会实际打开浏览器,运行速度很快。可以用来解决FireFox等浏览器做自动化测试运行速度很慢的问题。 

缺点:

它对JavaScript的支持不够好,当页面上有复杂JavaScript时,经常会捕获不到页面元素。 

 

HtmlUnitDriver 官网https://github.com/SeleniumHQ/selenium/wiki/HtmlUnitDriver

Selenium 官网http://www.seleniumhq.org/download/

从官网链接也可以看出,HtmlUnitDriver 是Selenium项目的子项目,其API接口也在selenium-xxx的jar包内

 

Eclipse 配置开发HtmlUnitDriver

1) 下载 selenium jar包

网上介绍了好多selenium jar包是从 selenium client 处下载,再解压便得到 selenium-java-xxx.jar,但米扑博客实测,此方案在 selenium 3.4.0版本是不行的。

下载的最新的selenium-java-3.4.0.zip,解压后发现只有lib目录,其下面并没有网上传说的 selenium-java-xxx.jar,于是折腾了一会,发现下载 selenium-server-standalone-3.4.0.jar,拷贝到Eclipse项目下lib目录下,再引用是可以的。

selenium官网下载: selenium-server-standalone-3.4.0.jar

拷贝到Eclipse项目MimvpProxy的lib目录下,目录结构如下图:

java-htmlunitdriver-proxy-dai-li-zhua-qu-wang-ye-01

右键项目名"MimvpProxy" —> 属性 —> Java Build Path —> Libraries —> Add JARS... —> 选择项目名"MimvpProxy"  —> lib目录  —> selenium-server-standalone-3.4.0.jar

java-htmlunitdriver-proxy-dai-li-zhua-qu-wang-ye-02

如上图,便完成了 selenium 和 htmlunitdriver 等jar包类库的导入

 

HtmlUnitDriver 开发示例

1) 爬取网页

package com.mimvp;

import org.openqa.selenium.htmlunit.HtmlUnitDriver;

public class MimvpProxy {
	final static String mimvpUrl = "http://mimvp.com";	// 爬取网址

	public static void main(String[] args) {
		getNoProxy();
	}

	// 不用代理爬取网页
	public static void getNoProxy() {
		HtmlUnitDriver driver = new HtmlUnitDriver(true);	// enable javascript
//		driver.setJavascriptEnabled(true);
		driver.get(mimvpUrl);
		String title = driver.getTitle();
		System.out.println(title);			// 米扑代理 - 全球免费高品质HTTP代理IP实时更新
	}
}

 

2)利用HTTP代理爬取网页

package com.mimvp;

import org.openqa.selenium.Proxy;
import org.openqa.selenium.htmlunit.HtmlUnitDriver;

public class MimvpProxy {
	final static String proxyUri = "183.222.102.98:8080";				// 代理服务器(HTTP)
	final static String mimvpUrl = "http://proxy.mimvp.com/exist.php";	// 爬取网址

	public static void main(String[] args) {
		getHttpProxy();
	}

	// HTTP代理爬取网页
	public static void getHttpProxy() {
		HtmlUnitDriver driver = new HtmlUnitDriver(true);	// enable javascript
		
		// 方法1
		driver.setProxy(proxyUri.split(":")[0], Integer.parseInt(proxyUri.split(":")[1]));		// proxyUri = "183.222.102.98:8080"
		
		// 方式2
		driver.setHTTPProxy(proxyUri.split(":")[0], Integer.parseInt(proxyUri.split(":")[1]), null);	// proxyUri = "183.222.102.98:8080"

		// 方法3
		Proxy proxy = new Proxy();
		proxy.setHttpProxy(proxyUri);		// 设置代理服务器地址, proxyUri = "183.222.102.98:8080"
		driver.setProxySettings(proxy);
		
		driver.get(mimvpUrl);
		
		String html = driver.getPageSource();
		System.out.println(html);
		String title = driver.getTitle();
		System.out.println(title);			// 米扑代理 - 全球免费高品质HTTP代理IP实时更新
	}
}

 

3) 利用Socks5代理爬取网页

package com.mimvp;

import org.openqa.selenium.Proxy;
import org.openqa.selenium.htmlunit.HtmlUnitDriver;

public class MimvpProxy {
	final static String proxySocks = "103.14.27.174:1080";				// 代理服务器(Socks5)
	final static String mimvpUrl = "http://proxy.mimvp.com/exist.php";	// 爬取网址

	public static void main(String[] args) {
		getSocksProxy();
	}

	// Socks代理爬取网页
	public static void getSocksProxy() {
		HtmlUnitDriver driver = new HtmlUnitDriver(true);	// enable javascript
		
		// 方式1
		driver.setSocksProxy(proxySocks.split(":")[0], Integer.parseInt(proxySocks.split(":")[1]));			// proxySocks = "183.239.240.138:1080"
		
		// 方式2
		driver.setSocksProxy(proxySocks.split(":")[0], Integer.parseInt(proxySocks.split(":")[1]), null);	// proxySocks = "183.239.240.138:1080"
		
		driver.get(mimvpUrl);
		
		String html = driver.getPageSource();
		System.out.println(html);
		String title = driver.getTitle();
		System.out.println(title);			// 米扑代理 - 全球免费高品质HTTP代理IP实时更新
	}
}

 

4) 利用验证用户代理爬取网页

package com.mimvp;

import org.openqa.selenium.Platform;
import org.openqa.selenium.Proxy;
import org.openqa.selenium.htmlunit.HtmlUnitDriver;
import org.openqa.selenium.remote.CapabilityType;
import org.openqa.selenium.remote.DesiredCapabilities;

import com.gargoylesoftware.htmlunit.DefaultCredentialsProvider;
import com.gargoylesoftware.htmlunit.WebClient;

public class MimvpProxy {
	final static String proxyUri = "183.222.102.98:8080";				// 代理服务器(HTTP)
	final static String proxySocks = "103.14.27.174:1080";				// 代理服务器(Socks5)
	final static String mimvpUrl = "http://proxy.mimvp.com/exist.php";	// 爬取网址

	public static void main(String[] args) {
		getAuthProxy();
	}

	// 代理需要用户名和密码
	public static void getAuthProxy() {
		HtmlUnitDriver driver = null;
		
		final String proxyUser = "mimvp-user";
        final String proxyPass = "mimvp-pwd";
        
		Proxy proxy = new Proxy();
		proxy.setHttpProxy(proxyUri);		// 设置代理服务器地址

		// 设置代理的用户名和密码
		DesiredCapabilities capabilities = DesiredCapabilities.htmlUnit();
		capabilities.setCapability(CapabilityType.PROXY, proxy);
		capabilities.setJavascriptEnabled(true);
		capabilities.setPlatform(Platform.WIN8_1);
		driver = new HtmlUnitDriver(capabilities) {
			@Override
			protected WebClient modifyWebClient(WebClient client) {
				DefaultCredentialsProvider creds = new DefaultCredentialsProvider();
				creds.addCredentials(proxyUser, proxyPass);
				client.setCredentialsProvider(creds);
				return client;
			}
		};
		driver.setJavascriptEnabled(true);	// enable javascript
		driver.get(mimvpUrl);
		String title = driver.getTitle();
		System.out.println(title);			// 米扑代理 - 全球免费高品质HTTP代理IP实时更新
	}
}

 

5)利用百度进行自动搜索

package com.mimvp;

import org.openqa.selenium.By;
import org.openqa.selenium.Keys;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.htmlunit.HtmlUnitDriver;
import org.openqa.selenium.interactions.Actions;

public class MimvpProxy {
	public static void main(String[] args) {
		getBaiduSearch("米扑科技");
	}

	// 进行百度搜索
	public static void getBaiduSearch(String keyword) {
		final String url = "http://www.baidu.com";
		WebDriver driver = new HtmlUnitDriver(false);
		driver.get(url);
		driver.findElement(By.id("kw")).sendKeys(keyword);
		Actions action = new Actions(driver);
		action.sendKeys(Keys.ENTER).perform();
		String html = driver.getPageSource();
		System.out.println(html);
	}
}

运行结果:

java-htmlunitdriver-proxy-dai-li-zhua-qu-wang-ye-03

 

 

参考推荐:

Python爬虫抓站的一些技巧

爬虫突破网站封禁不能抓取的6种常见方法

Python + Selenium2 + Chrome 爬取网页

Python使用Selenium和PhantomJS解析动态JS的网页

Selenium Webdriver 以代理proxy方式启动firefox,ie,chrome

selenium实现Xvfb在linux上无界面运行