scrapy 蜘蛛 选择器 存储URL数据

栏目: 编程工具 · 发布时间: 5年前

内容简介:在上一篇中,大致介绍了scripy的基础知识,下面介绍scrapy蜘蛛每个蜘蛛必须继承的该类,该类为所有蜘蛛的父类该类提供了一个默认的start_requests()的实现.

概述

在上一篇中,大致介绍了scripy的基础知识,下面介绍scrapy蜘蛛

蜘蛛经历周期

  1. 初始化请求,指定从这些请求中下载页面,并触发回调函数.
  2. 在回调函数中,继续解析请求的网页,从中提取结构化数据.
  3. 其中在回调函数中,可以从中提取页面内容.
  4. 持久化保存

scrapy.Spider

每个蜘蛛必须继承的该类,该类为所有蜘蛛的父类

该类提供了一个默认的start_requests()的实现.

若要定义一个蜘蛛,必须在构造方法中继承该类

一个蜘蛛的例子

import scrapy


class QuotesSpider(scrapy.Spider):
    # 识别蜘蛛
    name = "quotes"

    # 定义需要爬去的请求
    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url = url, callback=self.parse)


    # 定义调用方法
    def parse(self, response):
        for quote in response.css('div.quote'):
            # 迭代列表
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

选择器

首先打开shell

(venv) ➜  tutorial scrapy shell https://docs.scrapy.org/en/latest/_static/selectors-sample1.html
2019-04-03 22:49:56 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: tutorial)
2019-04-03 22:49:56 [scrapy.utils.log] INFO: Versions: lxml 4.3.3.0, libxml2 2.9.9, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 18.9.0, Python 3.6.7 (default, Oct 22 2018, 11:32:17) - [GCC 8.2.0], pyOpenSSL 19.0.0 (OpenSSL 1.1.1b  26 Feb 2019), cryptography 2.6.1, Platform Linux-4.18.0-17-generic-x86_64-with-Ubuntu-18.10-cosmic
2019-04-03 22:49:56 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'tutorial', 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0, 'NEWSPIDER_MODULE': 'tutorial.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['tutorial.spiders']}
2019-04-03 22:49:56 [scrapy.extensions.telnet] INFO: Telnet Password: 7f8ee92d3d9ad6cc
2019-04-03 22:49:56 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage']
2019-04-03 22:49:56 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-04-03 22:49:56 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-04-03 22:49:56 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-04-03 22:49:56 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-04-03 22:49:56 [scrapy.core.engine] INFO: Spider opened
2019-04-03 22:49:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://docs.scrapy.org/robots.txt> (referer: None)
2019-04-03 22:49:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://docs.scrapy.org/en/latest/_static/selectors-sample1.html> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x7fda92a1f908>
[s]   item       {}
[s]   request    <GET https://docs.scrapy.org/en/latest/_static/selectors-sample1.html>
[s]   response   <200 https://docs.scrapy.org/en/latest/_static/selectors-sample1.html>
[s]   settings   <scrapy.settings.Settings object at 0x7fda915606d8>
[s]   spider     <DefaultSpider 'default' at 0x7fda91269080>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects 
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
>>>

shell 加载以后,将会response作为响应变量.

并在response.selector属性中附加选择器.

>>> response.xpath('//title/text()')
[<Selector xpath='//title/text()' data='Example website'>]
>>>

实际提取文本数据使用get

>>> response.xpath('//title/text()').get()
'Example website'
>>>

css选择器也可

>>> response.css('title::text').get()
'Example website'
>>>

剩下的大概也没啥了.下面干啥呢.开爬,随便爬爬得了..学不了那么深...


以上所述就是小编给大家介绍的《scrapy 蜘蛛 选择器 存储URL数据》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

Trading and Exchanges

Trading and Exchanges

Larry Harris / Oxford University Press, USA / 2002-10-24 / USD 95.00

This book is about trading, the people who trade securities and contracts, the marketplaces where they trade, and the rules that govern it. Readers will learn about investors, brokers, dealers, arbit......一起来看看 《Trading and Exchanges》 这本书的介绍吧!

JS 压缩/解压工具
JS 压缩/解压工具

在线压缩/解压 JS 代码

在线进制转换器
在线进制转换器

各进制数互转换器

MD5 加密
MD5 加密

MD5 加密工具