爬虫入门之scrapy模拟登陆(十四)

栏目: 编程工具 · 发布时间: 7年前

内容简介:只要是需要提供post数据的,就可以用这种方法。下面示例里post的数据是账户密码:在发送请求时cookie的操作,如果实在没办法了,可以用这种方法模拟登录,虽然麻烦一点,但是成功率100%

注意:模拟登陆时,必须保证settings.py里的COOKIES_ENABLED(Cookies中间件) 处于开启状态

COOKIES_ENABLED = True# COOKIES_ENABLED = False

策略一:直接POST数据(比如需要登陆的账户信息)

只要是需要提供post数据的,就可以用这种方法。下面示例里post的数据是账户密码:

  • 可以使用 yield scrapy.FormRequest(url, formdata, callback) 方法发送POST请求。
  • 如果希望程序执行一开始就发送POST请求,可以重写Spider类的 start_requests(self) 方法,并且不再调用 start_urls 里的url。
class mySpider(scrapy.Spider):
    # start_urls = ["http://www.example.com/"]
    def start_requests(self):
        url = 'http://www.renren.com/PLogin.do'  #从源码中form表单提取的action网址

        # FormRequest 是Scrapy发送POST请求的方法
        yield scrapy.FormRequest(
            url = url,
            formdata = {"email" : "mr_mao_hacker@163.com", "password" : "axxxxxxxe"},
            callback = self.parse_page
        )
    def parse_page(self, response):
        # do something
        # 业务逻辑

策略二:标准的模拟登陆步骤

正统模拟登录方法:

FormRequest.from_response()
import scrapy

class LoginSpider(scrapy.Spider):
    name = 'example.com'
    start_urls = ['http://www.example.com/users/login.php']

    def parse(self, response):
        return scrapy.FormRequest.from_response(
        response,
        formdata={'username': 'john', 'password': 'secret'},
        callback=self.after_login
        )

    def after_login(self, response):
        # check login succeed before going on
        if "authentication failed" in response.body:
            self.log("Login failed", level=log.ERROR)
            return

# continue scraping with authenticated session...

模拟浏览器登录

start_requests()方法,可以返回一个请求给爬虫的起始网站,这个返回的请求相当于start_urls,start_requests()返回的请求会替代start_urls里的请求

Request()get请求,可以设置,url、cookie、回调函数

FormRequest.from_response()表单post提交,第一个必须参数,上一次响应cookie的response对象,其他参数,cookie、url、表单内容等

正统模拟登录方法

import scrapy

# 正统模拟登录方法:
# 首先发送登录页面的get请求,获取到页面里的登录必须的参数,比如说zhihu的 _xsrf
# 然后和账户密码一起post到服务器,登录成功

# 第二种标准
def parse(self, response):
    print(response.body.decode('utf-8'), "@@" * 40)
    yield scrapy.FormRequest.from_response(response,formdata={
                                                   "email": "18588403840", 
                                        "origURL":"http://www.renren.com/422167102/profile", 
                                                   "domain": "renren.com",
                                                   "key_id": "1",
                                                   "captcha_type": "web_login",
                                                   "password": "97bfc03b0eec4df7c76eaec10cd08ea57b01eefd0c0ffd4c0e5061ebd66460d9",
                                                   "rkey": "26615a8e93fee56fc1fb3d679afa3cc4",
                                                   "f": ""
                                               },
                                               dont_filter=True,
                                               headers=self.headers,
                                               callback=self.get_page)
    def get_page(self, response):
        print("===================", response.url)
        print(response.body.decode('utf-8'))
        url = "http://www.renren.com/353111356/profile"
        yield scrapy.Request(url, callback=self.get_info)

    def get_info(self, response):
        print('*******' * 30)
        print(response.body.decode('utf-8'))

yield Request()可以将一个新的请求返回给爬虫执行

在发送请求时cookie的操作, meta={'cookiejar':1}表示开启cookie记录,首次请求时写在Request()里 meta={'cookiejar':response.meta['cookiejar']}表示使用上一次response的cookie,写在FormRequest.from_response()里post授权 meta={'cookiejar':True}表示使用授权后的cookie访问需要登录查看的页面

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class MyrenSpider(CrawlSpider):
    name = 'myren'
    allowed_domains = ['renren.com']
    start_urls = ["http://www.renren.com/353111356/profile"]

    rules = [Rule(LinkExtractor(allow=('(\d+)/profile')), callback='get_info', follow=True)]
    headers = {
        "Accept": "*/*",
        "Connection": "keep-alive",
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36",
    }

    def start_requests(self):
        yield scrapy.Request(url="http://www.renren.com/", meta={'cookiejar': 1}, callback=self.post_login)

    # 第二种标准
    def post_login(self, response):
        yield scrapy.FormRequest.from_response(response,
                                               url="http://www.renren.com/PLogin.do",
                                               meta={'cookiejar': response.meta['cookiejar']},
                                               # 在之前需要打开 meta = {'cookiejar' : 1}
                                               headers=self.headers,
                                               formdata={
                                                   "email": "18588403840",
                                                   "password": "Changeme_123"
                                               },
                                               dont_filter=True,

                                               callback=self.after_login)

    def after_login(self, response):
        for url in self.start_urls:
            # yield self.make_requests_from_url(url)
            yield scrapy.Request(url, meta={'cookiejar': response.meta['cookiejar']})

    def get_info(self, response):
        print('*******' * 30)
        print(response.body.decode('utf-8'))
        
    def _requests_to_follow(self, response):  
    """重写加入cookiejar的更新"""  
    if not isinstance(response, HtmlResponse):  
        return  
    seen = set()  
    for n, rule in enumerate(self._rules):  
        links = [l for l in rule.link_extractor.extract_links(response) if l not in seen]  
        if links and rule.process_links:  
            links = rule.process_links(links)  
        for link in links:  
            seen.add(link)  
            r = Request(url=link.url, callback=self._response_downloaded)  
            # 下面这句是我重写的  
            r.meta.update(rule=n, link_text=link.text, cookiejar=response.meta['cookiejar'])  
            yield rule.process_request(r)

策略三:直接使用保存登陆状态的Cookie模拟登陆

如果实在没办法了,可以用这种方法模拟登录,虽然麻烦一点,但是成功率100%

ChangeCookies 将cookie解析成字典形式

class transCookie:

    def __init__(self, cookie):
        self.cookie = cookie

    def stringToDict(self):
        '''
        将从浏览器上Copy来的cookie字符串转化为Scrapy能使用的Dict
        :return:
        '''
        itemDict = {}
        items = self.cookie.split(';')
        for item in items:
            key = item.split('=')[0].strip()
            value = item.split('=')[1]
            itemDict[key] = value
        return itemDict

if __name__ == "__main__":
    cookie = "你的cookie"
    trans = transCookie(cookie)
    print(trans.stringToDict())

将解析好的cookie格式放入请求

# -*- coding: utf-8 -*-
import scrapy

class RenrenSpider(scrapy.Spider):
    name = "renren"
    allowed_domains = ["renren.com"]
    start_urls = [
        'http://www.renren.com/111111',
        'http://www.renren.com/222222',
        'http://www.renren.com/333333',
        ]  #开始请求url列表
    cookies = {
    "anonymid" : "ixrna3fysufnwv",
    "_r01_" : "1",
    "ap" : "327550029",
    "JSESSIONID" : "abciwg61A_RvtaRS3GjOv",
    "depovince" : "GW",
    "springskin" : "set",
    "jebe_key" : "f6fb270b-d06d-42e6-8b53-e67c3156aa7e%7Cc13c37f53bca9e1e7132d4b58ce00fa3%7C1484060607478%7C1%7C1486198628950",
    "t" : "691808127750a83d33704a565d8340ae9",
    "societyguester" : "691808127750a83d33704a565d8340ae9",
    "id" : "327550029",
    "xnsid" : "f42b25cf",
    "loginfrom" : "syshome"
    }

    # 可以重写Spider类的start_requests方法,附带Cookie值,发送POST请求
    def start_requests(self):
        return [scrapy.FormRequest(url, cookies = self.cookies, callback = self.parse)]

    # 处理响应内容
    def parse(self, response):
        print "===========" + response.url
        with open("deng.html", "w") as filename:
            filename.write(response.body)

策略四 : 使用selenium插件(全能)

1 spider.browser.page_source 获取响应的源代码

2 session.get(request.url).text 获取响应的源代码

3 requests采用session管理cookie

4 urllib 采用cookieJar管理cookie

模拟登录淘宝

class TaobaoSpider(scrapy.Spider):
    name = 'mytaobao'
    allowed_domains = ['taobao.com']
    start_urls = ['https://login.m.taobao.com/login.htm',
                  "http://h5.m.taobao.com/mlapp/olist.html?spm=a2141.7756461.2.6"]

    def __init__(self):  # 初始化
        self.browser = None
        self.cookies = None
        super(TaobaoSpider, self).__init__()  # 传递给父类
        
    def parse(self, response):
        # 打印链接,打印网页源代码
        print(response.url)
        print(response.body.decode("utf-8", "ignore"))
#中间件middleware 自定义LoginMiddleware登录

from scrapy import signals
from selenium import webdriver
from scrapy.http import HtmlResponse  # 网页响应
import requests
import time

class LoginMiddleware(object):
    '''
    找到password username输入框并send_keys
    点击登录并抓取cookie,spider.browser.get_cookies()
    返回页面信息,HtmlResponse
    '''
    def process_request(self, request, spider):
        if spider.name == "mytaobao":  # 指定仅仅处理这个名称的爬虫
            if request.url.find("login") != -1:  # 判断是否登陆页面
                mobilesetting = {"deviceName": "iPhone 6 Plus"}
                options = webdriver.ChromeOptions()  # 浏览器选项
                options.add_experimental_option("mobileEmulation", mobilesetting)  # 模拟手机
                spider.browser = webdriver.Chrome(chrome_options=options)  # 创建一个浏览器对象
                spider.browser.set_window_size(400, 800)  # 配置手机大小

                spider.browser.get(request.url)  # 爬虫访问链接
                time.sleep(3) #必须要睡下因为考虑到输入:用户名密码 要时间
                print("login访问", request.url)
                username = spider.browser.find_element_by_id("username")
                password = spider.browser.find_element_by_id("password")
                time.sleep(1) 
                username.send_keys("2403239393@qq.com")  # 账户
                time.sleep(2)
                password.send_keys("bama100")  # 密码
                time.sleep(2)
                spider.browser.find_element_by_id("btn-submit").click()
             
                time.sleep(4)
                spider.cookies = spider.browser.get_cookies()  # 抓取全部的cookie
                # spider.browser.close()
                return HtmlResponse(url=spider.browser.current_url,  # 当前连接
                                    body=spider.browser.page_source,  # 源代码
                                    encoding="utf-8")  # 返回页面信息
            else:#登录后则执行
                '''
                1 采用requests.session保存cookie
                2 设置cookie session.cookie.set(name,value)
                3 清空headers  session.headers.clear()
                4 发起get请求   session.get(url)
                '''
                print("request  访问")
                session = requests.session()  # 会话
                for cookie in spider.cookies:
                    session.cookies.set(cookie['name'], cookie["value"])
                session.headers.clear()  # 清空头
                newpage = session.get(request.url)
                print("---------------------")
                print(request.url)
                print("---------------------")
                print(newpage.text)
                print("---------------------")
                # 页面
                time.sleep(3)
                return HtmlResponse(url=request.url,  # 当前连接
                                    body=newpage.text,  # 源代码
                                    encoding="utf-8")  # 返回页面信息

以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

Think Python

Think Python

Allen B. Downey / O'Reilly Media / 2012-8-23 / GBP 29.99

Think Python is an introduction to Python programming for students with no programming experience. It starts with the most basic concepts of programming, and is carefully designed to define all terms ......一起来看看 《Think Python》 这本书的介绍吧!

图片转BASE64编码
图片转BASE64编码

在线图片转Base64编码工具

Base64 编码/解码
Base64 编码/解码

Base64 编码/解码