scrapy 入门教程

栏目: 编程工具 · 发布时间: 6年前

内容简介：这里以爬虫的http://quotes.toscrape.com/ 的网站作为例子,进行爬取该网站是官方教学的用作示例的网站,没有任何反爬机制,将会使用该网站进行爬取,用作例子,用于学习可以看到是spena标签的text文本

这里以爬虫的http://quotes.toscrape.com/ 的网站作为例子,进行爬取

该网站是官方教学的用作示例的网站,没有任何反爬机制,将会使用该网站进行爬取,用作例子,用于学习

一个例子

爬取该网站的author和text

首先打开开发者工具,查看author和text进行对比寻找相关的css

可以看到是div标签下的class属性为quote

一个页面有多个div.quote所以将会返回一个数组

代码如下

# 导入包
import scrapy


class QuotesSpider(scrapy.Spider):
    # 蜘蛛名称 用于执行蜘蛛
    name = "quotes"
    # 此处定义urls,即需要爬取的链接列表
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    # 当处理到此urls的时候,会调度到此函数,然后执行此函数,会把请求包装成response然后发送
    # self 用来标注当前的请求的
    def parse(self, response):
        # 获取response包装的div.quote标签
        for quote in response.css('div.quote'):

至此没有书写完成,需要获取div.quote下的另外的标签

可以看到是spena标签的text文本

所以添加yield进行赋值

# 导入包
import scrapy


class QuotesSpider(scrapy.Spider):
    # 蜘蛛名称 用于执行蜘蛛
    name = "quotes"
    # 此处定义urls,即需要爬取的链接列表
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    # 当处理到此urls的时候,会调度到此函数,然后执行此函数,会把请求包装成response然后发送
    # self 用来标注当前的请求的
    def parse(self, response):
        # 获取response包装的div.quote标签
        for quote in response.css('div.quote'):
            # 表明迭代列表
            yield {
                # 使用css选择器获取spen标签下的text文本中的text文本的内容
                'text': quote.css('spen.text::text').get(),
                # 下面使用xpath来进行选择
                'author': 
            }

接着使用xpath获取author

可以看到author标签是small标签

所以进行获取

完整如图所示

# 导入包
import scrapy


class QuotesSpider(scrapy.Spider):
    # 蜘蛛名称 用于执行蜘蛛
    name = "quotes"
    # 此处定义urls,即需要爬取的链接列表
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    # 当处理到此urls的时候,会调度到此函数,然后执行此函数,会把请求包装成response然后发送
    # self 用来标注当前的请求的
    def parse(self, response):
        # 获取response包装的div.quote标签
        for quote in response.css('div.quote'):
            # 表明迭代列表
            yield {
                # 使用css选择器获取spen标签下的text文本中的text文本的内容
                'text': quote.css('spen.text::text').get(),
                # 下面使用xpath来进行选择
                'author': quote.xpath('span/small/text()').get(),
            }

刚刚发生了什么

它会调用quotes蜘蛛,进行爬行,爬取start_urls列表中的url,在运行的时候会调度到parse函数,将会传入两个参数,一个为self,用于标记当前的执行的,response,用于http请求返回的结果包装成为response对象.

以后,只需要调用response对象进行选择即可.

在进行调用的时候是异步调度和处理的

查看生成的json

[
{"text": "\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d", "author": "Jane Austen"},
{"text": "\u201cA day without sunshine is like, you know, night.\u201d", "author": "Steve Martin"},
{"text": "\u201cAnyone who thinks sitting in church can make you a Christian must also think that sitting in a garage can make you a car.\u201d", "author": "Garrison Keillor"},
{"text": "\u201cBeauty is in the eye of the beholder and it may be necessary from time to time to give a stupid or misinformed beholder a black eye.\u201d", "author": "Jim Henson"},
{"text": "\u201cAll you need is love. But a little chocolate now and then doesn't hurt.\u201d", "author": "Charles M. Schulz"},
{"text": "\u201cRemember, we're madly in love, so it's all right to kiss me anytime you feel like it.\u201d", "author": "Suzanne Collins"},
{"text": "\u201cSome people never go crazy. What truly horrible lives they must lead.\u201d", "author": "Charles Bukowski"},
{"text": "\u201cThe trouble with having an open mind, of course, is that people will insist on coming along and trying to put things in it.\u201d", "author": "Terry Pratchett"},
{"text": "\u201cThink left and think right and think low and think high. Oh, the thinks you can think up if only you try!\u201d", "author": "Dr. Seuss"},
{"text": "\u201cThe reason I talk to myself is because I\u2019m the only one whose answers I accept.\u201d", "author": "George Carlin"}
]

可以看到已经爬取完成

列表爬取

使用列表爬取

爬取第一页的内容,爬取第二页的内容

http://quotes.toscrape.com/page/1/

http://quotes.toscrape.com/page/2/

更改代码运行即可

# 导入包
import scrapy


class QuotesSpider(scrapy.Spider):
    # 蜘蛛名称 用于执行蜘蛛
    name = "quotes"
    # 此处定义urls,即需要爬取的链接列表
    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/'
        ]
        # 进行读取 采用yield
        for url in urls:
            yield scrapy.Request(url = url, callback=self.parse)

    # 当处理到此urls的时候,会调度到此函数,然后执行此函数,会把请求包装成response然后发送
    # self 用来标注当前的请求的
    def parse(self, response):
        # 获取response包装的div.quote标签
        for quote in response.css('div.quote'):
            # 表明迭代列表
            yield {
                # 使用css选择器获取spen标签下的text文本中的text文本的内容
                'text': quote.css('span.text::text').get(),
                # 下面使用xpath来进行选择
                'author': quote.xpath('span/small/text()').get(),
            }

数据字典

抓取是从非结构化中,提取结构化

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

时间的朋友2018

罗振宇 / 中信出版集团 / 2019-1

2018年，有点不一样。从年头到现在，各种信息扑面而来。不管你怎么研判这些信息的深意，有一点是有共识的：2018，我们站在了一个时代的门槛上，陌生，崭新。就像一个少年长大了，有些艰困必须承当，有些道路只能独行。用经济学家的话说，2018年，我们面对的是一次巨大的“不确定性”。所谓“不确定性”，就是无法用过去的经验判断未来事情发生的概率。所以，此时轻言乐观、悲观，都没有什么意......一起来看看《时间的朋友2018》这本书的介绍吧!

码农工具

scrapy 入门教程

一个例子

刚刚发生了什么

列表爬取

数据字典

时间的朋友2018

JSON 在线解析

RGB转16进制工具

RGB HSV 转换