内容简介:以和讯网的一个页面为例:执行结果:
在使用webmgiac的过程中,很多时候我们需要抓取连接的绝对路径,总结了几种方法,示例代码放在最后。
以和讯网的一个页面为例:
xpath方式获取
log.info("{}", page.getHtml().xpath("//div[@id='cyldata']").links().all()); log.info("{}", page.getHtml().xpath("//div[@id='cyldata']//a//@abs:href").all()); 复制代码
xpath+css选择器方式获取
log.info("{}", page.getHtml().xpath("//div[@id='cyldata']").css("a", "abs:href").all()); 复制代码
css选择器方式获取
log.info("{}", page.getHtml().css("div[id='cyldata']").css("a", "abs:href").all()); log.info("{}", page.getHtml().css("div[id='cyldata']").links().all()); log.info("{}", page.getHtml().css("div[id='cyldata'] a").links().all()); log.info("{}", page.getHtml().css("div[id='cyldata'] a", "abs:href").all()); 复制代码
jsoup方式获取
for (Element element : Jsoup.parse(page.getRawText(), page.getRequest().getUrl()).select("#cyldata a")) { log.info("{}", element.attr("abs:href")); log.info("{}", element.absUrl("href")); } 复制代码
jsoup中stringutil工具类方式获取
for (Element element : Jsoup.parse(page.getRawText(), page.getRequest().getUrl()).select("#cyldata a")) { log.info("{}", StringUtil.resolve(page.getRequest().getUrl(), element.attr("href"))); } 复制代码
示例代码
<?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <parent> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-parent</artifactId> <version>2.1.4.RELEASE</version> <relativePath/> <!-- lookup parent from repository --> </parent> <groupId>com.ady01</groupId> <artifactId>java-pachong</artifactId> <version>0.0.1-SNAPSHOT</version> <name>java-pachong</name> <description>java爬虫项目</description> <properties> <java.version>1.8</java.version> </properties> <dependencies> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter</artifactId> </dependency> <dependency> <groupId>org.projectlombok</groupId> <artifactId>lombok</artifactId> <optional>true</optional> </dependency> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-test</artifactId> <scope>test</scope> </dependency> <!-- webmagic start --> <dependency> <groupId>us.codecraft</groupId> <artifactId>webmagic-core</artifactId> <version>0.7.3</version> <exclusions> <exclusion> <artifactId>fastjson</artifactId> <groupId>com.alibaba</groupId> </exclusion> <exclusion> <artifactId>commons-io</artifactId> <groupId>commons-io</groupId> </exclusion> <exclusion> <artifactId>commons-io</artifactId> <groupId>commons-io</groupId> </exclusion> <exclusion> <artifactId>fastjson</artifactId> <groupId>com.alibaba</groupId> </exclusion> <exclusion> <artifactId>fastjson</artifactId> <groupId>com.alibaba</groupId> </exclusion> <exclusion> <artifactId>log4j</artifactId> <groupId>log4j</groupId> </exclusion> <exclusion> <artifactId>slf4j-log4j12</artifactId> <groupId>org.slf4j</groupId> </exclusion> </exclusions> </dependency> <dependency> <groupId>us.codecraft</groupId> <artifactId>webmagic-extension</artifactId> <version>0.7.3</version> </dependency> <dependency> <groupId>us.codecraft</groupId> <artifactId>webmagic-selenium</artifactId> <version>0.7.3</version> </dependency> <dependency> <groupId>net.minidev</groupId> <artifactId>json-smart</artifactId> <version>2.2.1</version> </dependency> <!-- webmagic end --> <dependency> <groupId>com.alibaba</groupId> <artifactId>fastjson</artifactId> <version>1.2.49</version> </dependency> <dependency> <groupId>commons-lang</groupId> <artifactId>commons-lang</artifactId> <version>2.6</version> </dependency> <dependency> <groupId>commons-io</groupId> <artifactId>commons-io</artifactId> <version>2.6</version> </dependency> <dependency> <groupId>commons-codec</groupId> <artifactId>commons-codec</artifactId> <version>1.11</version> </dependency> <dependency> <groupId>commons-collections</groupId> <artifactId>commons-collections</artifactId> <version>3.2.2</version> </dependency> </dependencies> <build> <plugins> <plugin> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-maven-plugin</artifactId> </plugin> </plugins> </build> </project> 复制代码
package com.ady01.demo3; import lombok.extern.slf4j.Slf4j; import org.jsoup.Jsoup; import org.jsoup.helper.StringUtil; import org.jsoup.nodes.Element; import us.codecraft.webmagic.Page; import us.codecraft.webmagic.Request; import us.codecraft.webmagic.Site; import us.codecraft.webmagic.Spider; import us.codecraft.webmagic.processor.PageProcessor; /** * <b>description</b>:webmagic中获取绝对路径 <br> * <b>time</b>:2019/4/22 10:42 <br> * <b>author</b>:微信公众号:路人甲Java,专注于 java 技术分享(带你玩转 爬虫、分布式事务、异步消息服务、任务调度、分库分表、大数据等),喜欢请关注! */ @Slf4j public class AbsHrefPageProcessor implements PageProcessor { Site site = Site.me().setSleepTime(1000); @Override public void process(Page page) { //获取超链接绝对路径的方式 log.info("----------------------xpath方式获取------------------------"); //xpath方式获取 log.info("{}", page.getHtml().xpath("//div[@id='cyldata']").links().all()); log.info("{}", page.getHtml().xpath("//div[@id='cyldata']//a//@abs:href").all()); //xpath+css选择器方式获取 log.info("----------------------xpath+css选择器方式获取------------------------"); log.info("{}", page.getHtml().xpath("//div[@id='cyldata']").css("a", "abs:href").all()); //css选择器方式获取 log.info("----------------------css选择器方式获取------------------------"); log.info("{}", page.getHtml().css("div[id='cyldata']").css("a", "abs:href").all()); log.info("{}", page.getHtml().css("div[id='cyldata']").links().all()); log.info("{}", page.getHtml().css("div[id='cyldata'] a").links().all()); log.info("{}", page.getHtml().css("div[id='cyldata'] a", "abs:href").all()); //jsoup方式获取 log.info("----------------------jsoup方式获取------------------------"); for (Element element : Jsoup.parse(page.getRawText(), page.getRequest().getUrl()).select("#cyldata a")) { log.info("{}", element.attr("abs:href")); log.info("{}", element.absUrl("href")); } //jsoup中stringutil工具类方式获取 log.info("----------------------jsoup中stringutil工具类方式获取------------------------"); for (Element element : Jsoup.parse(page.getRawText(), page.getRequest().getUrl()).select("#cyldata a")) { log.info("{}", StringUtil.resolve(page.getRequest().getUrl(), element.attr("href"))); } } @Override public Site getSite() { return site; } public static void main(String[] args) { Request request = new Request("http://industry.hexun.com/c193_59.shtml"); Spider.create(new AbsHrefPageProcessor()).addRequest(request).run(); } } 复制代码
执行结果:
以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网
猜你喜欢:- 设计模式第三讲-装饰者模式
- django基础教程第三讲-路由和模板
- 应用编程基础课第三讲:Go编程基础
- 应用编程基础课第三讲:Go编程基础
- 挑战全网最幽默的Vuex系列教程:第三讲 Vuex旗下的Mutation
- (第三讲)使用JUnit对Spring Boot中的Rest Controller进行单元测试
本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
Hacking Growth
Sean Ellis、Morgan Brown / Crown Business / 2017-4-25 / USD 29.00
The definitive playbook by the pioneers of Growth Hacking, one of the hottest business methodologies in Silicon Valley and beyond. It seems hard to believe today, but there was a time when Airbnb w......一起来看看 《Hacking Growth》 这本书的介绍吧!