内容简介:以和讯网的一个页面为例:执行结果:
在使用webmgiac的过程中,很多时候我们需要抓取连接的绝对路径,总结了几种方法,示例代码放在最后。
以和讯网的一个页面为例:
xpath方式获取
log.info("{}", page.getHtml().xpath("//div[@id='cyldata']").links().all()); log.info("{}", page.getHtml().xpath("//div[@id='cyldata']//a//@abs:href").all()); 复制代码
xpath+css选择器方式获取
log.info("{}", page.getHtml().xpath("//div[@id='cyldata']").css("a", "abs:href").all()); 复制代码
css选择器方式获取
log.info("{}", page.getHtml().css("div[id='cyldata']").css("a", "abs:href").all()); log.info("{}", page.getHtml().css("div[id='cyldata']").links().all()); log.info("{}", page.getHtml().css("div[id='cyldata'] a").links().all()); log.info("{}", page.getHtml().css("div[id='cyldata'] a", "abs:href").all()); 复制代码
jsoup方式获取
for (Element element : Jsoup.parse(page.getRawText(), page.getRequest().getUrl()).select("#cyldata a")) { log.info("{}", element.attr("abs:href")); log.info("{}", element.absUrl("href")); } 复制代码
jsoup中stringutil工具类方式获取
for (Element element : Jsoup.parse(page.getRawText(), page.getRequest().getUrl()).select("#cyldata a")) { log.info("{}", StringUtil.resolve(page.getRequest().getUrl(), element.attr("href"))); } 复制代码
示例代码
<?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <parent> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-parent</artifactId> <version>2.1.4.RELEASE</version> <relativePath/> <!-- lookup parent from repository --> </parent> <groupId>com.ady01</groupId> <artifactId>java-pachong</artifactId> <version>0.0.1-SNAPSHOT</version> <name>java-pachong</name> <description>java爬虫项目</description> <properties> <java.version>1.8</java.version> </properties> <dependencies> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter</artifactId> </dependency> <dependency> <groupId>org.projectlombok</groupId> <artifactId>lombok</artifactId> <optional>true</optional> </dependency> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-test</artifactId> <scope>test</scope> </dependency> <!-- webmagic start --> <dependency> <groupId>us.codecraft</groupId> <artifactId>webmagic-core</artifactId> <version>0.7.3</version> <exclusions> <exclusion> <artifactId>fastjson</artifactId> <groupId>com.alibaba</groupId> </exclusion> <exclusion> <artifactId>commons-io</artifactId> <groupId>commons-io</groupId> </exclusion> <exclusion> <artifactId>commons-io</artifactId> <groupId>commons-io</groupId> </exclusion> <exclusion> <artifactId>fastjson</artifactId> <groupId>com.alibaba</groupId> </exclusion> <exclusion> <artifactId>fastjson</artifactId> <groupId>com.alibaba</groupId> </exclusion> <exclusion> <artifactId>log4j</artifactId> <groupId>log4j</groupId> </exclusion> <exclusion> <artifactId>slf4j-log4j12</artifactId> <groupId>org.slf4j</groupId> </exclusion> </exclusions> </dependency> <dependency> <groupId>us.codecraft</groupId> <artifactId>webmagic-extension</artifactId> <version>0.7.3</version> </dependency> <dependency> <groupId>us.codecraft</groupId> <artifactId>webmagic-selenium</artifactId> <version>0.7.3</version> </dependency> <dependency> <groupId>net.minidev</groupId> <artifactId>json-smart</artifactId> <version>2.2.1</version> </dependency> <!-- webmagic end --> <dependency> <groupId>com.alibaba</groupId> <artifactId>fastjson</artifactId> <version>1.2.49</version> </dependency> <dependency> <groupId>commons-lang</groupId> <artifactId>commons-lang</artifactId> <version>2.6</version> </dependency> <dependency> <groupId>commons-io</groupId> <artifactId>commons-io</artifactId> <version>2.6</version> </dependency> <dependency> <groupId>commons-codec</groupId> <artifactId>commons-codec</artifactId> <version>1.11</version> </dependency> <dependency> <groupId>commons-collections</groupId> <artifactId>commons-collections</artifactId> <version>3.2.2</version> </dependency> </dependencies> <build> <plugins> <plugin> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-maven-plugin</artifactId> </plugin> </plugins> </build> </project> 复制代码
package com.ady01.demo3; import lombok.extern.slf4j.Slf4j; import org.jsoup.Jsoup; import org.jsoup.helper.StringUtil; import org.jsoup.nodes.Element; import us.codecraft.webmagic.Page; import us.codecraft.webmagic.Request; import us.codecraft.webmagic.Site; import us.codecraft.webmagic.Spider; import us.codecraft.webmagic.processor.PageProcessor; /** * <b>description</b>:webmagic中获取绝对路径 <br> * <b>time</b>:2019/4/22 10:42 <br> * <b>author</b>:微信公众号:路人甲Java,专注于 java 技术分享(带你玩转 爬虫、分布式事务、异步消息服务、任务调度、分库分表、大数据等),喜欢请关注! */ @Slf4j public class AbsHrefPageProcessor implements PageProcessor { Site site = Site.me().setSleepTime(1000); @Override public void process(Page page) { //获取超链接绝对路径的方式 log.info("----------------------xpath方式获取------------------------"); //xpath方式获取 log.info("{}", page.getHtml().xpath("//div[@id='cyldata']").links().all()); log.info("{}", page.getHtml().xpath("//div[@id='cyldata']//a//@abs:href").all()); //xpath+css选择器方式获取 log.info("----------------------xpath+css选择器方式获取------------------------"); log.info("{}", page.getHtml().xpath("//div[@id='cyldata']").css("a", "abs:href").all()); //css选择器方式获取 log.info("----------------------css选择器方式获取------------------------"); log.info("{}", page.getHtml().css("div[id='cyldata']").css("a", "abs:href").all()); log.info("{}", page.getHtml().css("div[id='cyldata']").links().all()); log.info("{}", page.getHtml().css("div[id='cyldata'] a").links().all()); log.info("{}", page.getHtml().css("div[id='cyldata'] a", "abs:href").all()); //jsoup方式获取 log.info("----------------------jsoup方式获取------------------------"); for (Element element : Jsoup.parse(page.getRawText(), page.getRequest().getUrl()).select("#cyldata a")) { log.info("{}", element.attr("abs:href")); log.info("{}", element.absUrl("href")); } //jsoup中stringutil工具类方式获取 log.info("----------------------jsoup中stringutil工具类方式获取------------------------"); for (Element element : Jsoup.parse(page.getRawText(), page.getRequest().getUrl()).select("#cyldata a")) { log.info("{}", StringUtil.resolve(page.getRequest().getUrl(), element.attr("href"))); } } @Override public Site getSite() { return site; } public static void main(String[] args) { Request request = new Request("http://industry.hexun.com/c193_59.shtml"); Spider.create(new AbsHrefPageProcessor()).addRequest(request).run(); } } 复制代码
执行结果:
以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网
猜你喜欢:- 设计模式第三讲-装饰者模式
- django基础教程第三讲-路由和模板
- 应用编程基础课第三讲:Go编程基础
- 应用编程基础课第三讲:Go编程基础
- 挑战全网最幽默的Vuex系列教程:第三讲 Vuex旗下的Mutation
- (第三讲)使用JUnit对Spring Boot中的Rest Controller进行单元测试
本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。