内容简介:在上一篇中,大致介绍了scripy的基础知识,下面介绍scrapy蜘蛛每个蜘蛛必须继承的该类,该类为所有蜘蛛的父类该类提供了一个默认的start_requests()的实现.
概述
在上一篇中,大致介绍了scripy的基础知识,下面介绍scrapy蜘蛛
蜘蛛经历周期
- 初始化请求,指定从这些请求中下载页面,并触发回调函数.
- 在回调函数中,继续解析请求的网页,从中提取结构化数据.
- 其中在回调函数中,可以从中提取页面内容.
- 持久化保存
scrapy.Spider
每个蜘蛛必须继承的该类,该类为所有蜘蛛的父类
该类提供了一个默认的start_requests()的实现.
若要定义一个蜘蛛,必须在构造方法中继承该类
一个蜘蛛的例子
import scrapy
class QuotesSpider(scrapy.Spider):
# 识别蜘蛛
name = "quotes"
# 定义需要爬去的请求
def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url = url, callback=self.parse)
# 定义调用方法
def parse(self, response):
for quote in response.css('div.quote'):
# 迭代列表
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('small.author::text').get(),
'tags': quote.css('div.tags a.tag::text').getall(),
}
选择器
首先打开shell
(venv) ➜ tutorial scrapy shell https://docs.scrapy.org/en/latest/_static/selectors-sample1.html
2019-04-03 22:49:56 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: tutorial)
2019-04-03 22:49:56 [scrapy.utils.log] INFO: Versions: lxml 4.3.3.0, libxml2 2.9.9, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 18.9.0, Python 3.6.7 (default, Oct 22 2018, 11:32:17) - [GCC 8.2.0], pyOpenSSL 19.0.0 (OpenSSL 1.1.1b 26 Feb 2019), cryptography 2.6.1, Platform Linux-4.18.0-17-generic-x86_64-with-Ubuntu-18.10-cosmic
2019-04-03 22:49:56 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'tutorial', 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0, 'NEWSPIDER_MODULE': 'tutorial.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['tutorial.spiders']}
2019-04-03 22:49:56 [scrapy.extensions.telnet] INFO: Telnet Password: 7f8ee92d3d9ad6cc
2019-04-03 22:49:56 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage']
2019-04-03 22:49:56 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-04-03 22:49:56 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-04-03 22:49:56 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-04-03 22:49:56 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-04-03 22:49:56 [scrapy.core.engine] INFO: Spider opened
2019-04-03 22:49:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://docs.scrapy.org/robots.txt> (referer: None)
2019-04-03 22:49:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://docs.scrapy.org/en/latest/_static/selectors-sample1.html> (referer: None)
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x7fda92a1f908>
[s] item {}
[s] request <GET https://docs.scrapy.org/en/latest/_static/selectors-sample1.html>
[s] response <200 https://docs.scrapy.org/en/latest/_static/selectors-sample1.html>
[s] settings <scrapy.settings.Settings object at 0x7fda915606d8>
[s] spider <DefaultSpider 'default' at 0x7fda91269080>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
>>>
在 shell 加载以后,将会response作为响应变量.
并在response.selector属性中附加选择器.
>>> response.xpath('//title/text()')
[<Selector xpath='//title/text()' data='Example website'>]
>>>
实际提取文本数据使用get
>>> response.xpath('//title/text()').get()
'Example website'
>>>
css选择器也可
>>> response.css('title::text').get()
'Example website'
>>>
剩下的大概也没啥了.下面干啥呢.开爬,随便爬爬得了..学不了那么深...
以上所述就是小编给大家介绍的《scrapy 蜘蛛 选择器 存储URL数据》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:- 《蜘蛛侠:平行宇宙》的视觉解析与滤镜实现
- 「蜘蛛」来了!耶鲁大学11名学生标注完成大规模复杂跨域Text-to-SQL数据集Spider
- 块存储、文件存储、对象存储三者之比较
- 云原生存储详解:容器存储与 K8s 存储卷
- Android 存储(本地存储 SD卡存储 SharedPreference SQLite ContentProvider)
- 存储技术之云存储
本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。