内容简介:在上一篇中,大致介绍了scripy的基础知识,下面介绍scrapy蜘蛛每个蜘蛛必须继承的该类,该类为所有蜘蛛的父类该类提供了一个默认的start_requests()的实现.
概述
在上一篇中,大致介绍了scripy的基础知识,下面介绍scrapy蜘蛛
蜘蛛经历周期
- 初始化请求,指定从这些请求中下载页面,并触发回调函数.
- 在回调函数中,继续解析请求的网页,从中提取结构化数据.
- 其中在回调函数中,可以从中提取页面内容.
- 持久化保存
scrapy.Spider
每个蜘蛛必须继承的该类,该类为所有蜘蛛的父类
该类提供了一个默认的start_requests()的实现.
若要定义一个蜘蛛,必须在构造方法中继承该类
一个蜘蛛的例子
import scrapy class QuotesSpider(scrapy.Spider): # 识别蜘蛛 name = "quotes" # 定义需要爬去的请求 def start_requests(self): urls = [ 'http://quotes.toscrape.com/page/1/', 'http://quotes.toscrape.com/page/2/', ] for url in urls: yield scrapy.Request(url = url, callback=self.parse) # 定义调用方法 def parse(self, response): for quote in response.css('div.quote'): # 迭代列表 yield { 'text': quote.css('span.text::text').get(), 'author': quote.css('small.author::text').get(), 'tags': quote.css('div.tags a.tag::text').getall(), }
选择器
首先打开shell
(venv) ➜ tutorial scrapy shell https://docs.scrapy.org/en/latest/_static/selectors-sample1.html 2019-04-03 22:49:56 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: tutorial) 2019-04-03 22:49:56 [scrapy.utils.log] INFO: Versions: lxml 4.3.3.0, libxml2 2.9.9, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 18.9.0, Python 3.6.7 (default, Oct 22 2018, 11:32:17) - [GCC 8.2.0], pyOpenSSL 19.0.0 (OpenSSL 1.1.1b 26 Feb 2019), cryptography 2.6.1, Platform Linux-4.18.0-17-generic-x86_64-with-Ubuntu-18.10-cosmic 2019-04-03 22:49:56 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'tutorial', 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0, 'NEWSPIDER_MODULE': 'tutorial.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['tutorial.spiders']} 2019-04-03 22:49:56 [scrapy.extensions.telnet] INFO: Telnet Password: 7f8ee92d3d9ad6cc 2019-04-03 22:49:56 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.memusage.MemoryUsage'] 2019-04-03 22:49:56 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2019-04-03 22:49:56 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2019-04-03 22:49:56 [scrapy.middleware] INFO: Enabled item pipelines: [] 2019-04-03 22:49:56 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 2019-04-03 22:49:56 [scrapy.core.engine] INFO: Spider opened 2019-04-03 22:49:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://docs.scrapy.org/robots.txt> (referer: None) 2019-04-03 22:49:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://docs.scrapy.org/en/latest/_static/selectors-sample1.html> (referer: None) [s] Available Scrapy objects: [s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc) [s] crawler <scrapy.crawler.Crawler object at 0x7fda92a1f908> [s] item {} [s] request <GET https://docs.scrapy.org/en/latest/_static/selectors-sample1.html> [s] response <200 https://docs.scrapy.org/en/latest/_static/selectors-sample1.html> [s] settings <scrapy.settings.Settings object at 0x7fda915606d8> [s] spider <DefaultSpider 'default' at 0x7fda91269080> [s] Useful shortcuts: [s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed) [s] fetch(req) Fetch a scrapy.Request and update local objects [s] shelp() Shell help (print this help) [s] view(response) View response in a browser >>>
在 shell 加载以后,将会response作为响应变量.
并在response.selector属性中附加选择器.
>>> response.xpath('//title/text()') [<Selector xpath='//title/text()' data='Example website'>] >>>
实际提取文本数据使用get
>>> response.xpath('//title/text()').get() 'Example website' >>>
css选择器也可
>>> response.css('title::text').get() 'Example website' >>>
剩下的大概也没啥了.下面干啥呢.开爬,随便爬爬得了..学不了那么深...
以上所述就是小编给大家介绍的《scrapy 蜘蛛 选择器 存储URL数据》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:- 《蜘蛛侠:平行宇宙》的视觉解析与滤镜实现
- 「蜘蛛」来了!耶鲁大学11名学生标注完成大规模复杂跨域Text-to-SQL数据集Spider
- 块存储、文件存储、对象存储三者之比较
- 云原生存储详解:容器存储与 K8s 存储卷
- Android 存储(本地存储 SD卡存储 SharedPreference SQLite ContentProvider)
- 存储技术之云存储
本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
Trading and Exchanges
Larry Harris / Oxford University Press, USA / 2002-10-24 / USD 95.00
This book is about trading, the people who trade securities and contracts, the marketplaces where they trade, and the rules that govern it. Readers will learn about investors, brokers, dealers, arbit......一起来看看 《Trading and Exchanges》 这本书的介绍吧!