内容简介:在上一篇中,大致介绍了scripy的基础知识,下面介绍scrapy蜘蛛每个蜘蛛必须继承的该类,该类为所有蜘蛛的父类该类提供了一个默认的start_requests()的实现.
概述
在上一篇中,大致介绍了scripy的基础知识,下面介绍scrapy蜘蛛
蜘蛛经历周期
- 初始化请求,指定从这些请求中下载页面,并触发回调函数.
- 在回调函数中,继续解析请求的网页,从中提取结构化数据.
- 其中在回调函数中,可以从中提取页面内容.
- 持久化保存
scrapy.Spider
每个蜘蛛必须继承的该类,该类为所有蜘蛛的父类
该类提供了一个默认的start_requests()的实现.
若要定义一个蜘蛛,必须在构造方法中继承该类
一个蜘蛛的例子
import scrapy class QuotesSpider(scrapy.Spider): # 识别蜘蛛 name = "quotes" # 定义需要爬去的请求 def start_requests(self): urls = [ 'http://quotes.toscrape.com/page/1/', 'http://quotes.toscrape.com/page/2/', ] for url in urls: yield scrapy.Request(url = url, callback=self.parse) # 定义调用方法 def parse(self, response): for quote in response.css('div.quote'): # 迭代列表 yield { 'text': quote.css('span.text::text').get(), 'author': quote.css('small.author::text').get(), 'tags': quote.css('div.tags a.tag::text').getall(), }
选择器
首先打开shell
(venv) ➜ tutorial scrapy shell https://docs.scrapy.org/en/latest/_static/selectors-sample1.html 2019-04-03 22:49:56 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: tutorial) 2019-04-03 22:49:56 [scrapy.utils.log] INFO: Versions: lxml 4.3.3.0, libxml2 2.9.9, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 18.9.0, Python 3.6.7 (default, Oct 22 2018, 11:32:17) - [GCC 8.2.0], pyOpenSSL 19.0.0 (OpenSSL 1.1.1b 26 Feb 2019), cryptography 2.6.1, Platform Linux-4.18.0-17-generic-x86_64-with-Ubuntu-18.10-cosmic 2019-04-03 22:49:56 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'tutorial', 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0, 'NEWSPIDER_MODULE': 'tutorial.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['tutorial.spiders']} 2019-04-03 22:49:56 [scrapy.extensions.telnet] INFO: Telnet Password: 7f8ee92d3d9ad6cc 2019-04-03 22:49:56 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.memusage.MemoryUsage'] 2019-04-03 22:49:56 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2019-04-03 22:49:56 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2019-04-03 22:49:56 [scrapy.middleware] INFO: Enabled item pipelines: [] 2019-04-03 22:49:56 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 2019-04-03 22:49:56 [scrapy.core.engine] INFO: Spider opened 2019-04-03 22:49:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://docs.scrapy.org/robots.txt> (referer: None) 2019-04-03 22:49:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://docs.scrapy.org/en/latest/_static/selectors-sample1.html> (referer: None) [s] Available Scrapy objects: [s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc) [s] crawler <scrapy.crawler.Crawler object at 0x7fda92a1f908> [s] item {} [s] request <GET https://docs.scrapy.org/en/latest/_static/selectors-sample1.html> [s] response <200 https://docs.scrapy.org/en/latest/_static/selectors-sample1.html> [s] settings <scrapy.settings.Settings object at 0x7fda915606d8> [s] spider <DefaultSpider 'default' at 0x7fda91269080> [s] Useful shortcuts: [s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed) [s] fetch(req) Fetch a scrapy.Request and update local objects [s] shelp() Shell help (print this help) [s] view(response) View response in a browser >>>
在 shell 加载以后,将会response作为响应变量.
并在response.selector属性中附加选择器.
>>> response.xpath('//title/text()') [<Selector xpath='//title/text()' data='Example website'>] >>>
实际提取文本数据使用get
>>> response.xpath('//title/text()').get() 'Example website' >>>
css选择器也可
>>> response.css('title::text').get() 'Example website' >>>
剩下的大概也没啥了.下面干啥呢.开爬,随便爬爬得了..学不了那么深...
以上所述就是小编给大家介绍的《scrapy 蜘蛛 选择器 存储URL数据》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:- 《蜘蛛侠:平行宇宙》的视觉解析与滤镜实现
- 「蜘蛛」来了!耶鲁大学11名学生标注完成大规模复杂跨域Text-to-SQL数据集Spider
- 块存储、文件存储、对象存储三者之比较
- 云原生存储详解:容器存储与 K8s 存储卷
- Android 存储(本地存储 SD卡存储 SharedPreference SQLite ContentProvider)
- 存储技术之云存储
本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
PCI Express 体系结构导读
王齐 / 机械工业 / 2010-3 / 55.00元
《PCI Express 体系结构导读》讲述了与PCI及PCI Express总线相关的最为基础的内容,并介绍了一些必要的、与PCI总线相关的处理器体系结构知识,这也是《PCI Express 体系结构导读》的重点所在。深入理解处理器体系结构是理解PCI与PCI Express总线的重要基础。 读者通过对《PCI Express 体系结构导读》的学习,可超越PCI与PCI Express总线......一起来看看 《PCI Express 体系结构导读》 这本书的介绍吧!