爬虫入库实战之干死反爬虫

栏目: 数据库 · 发布时间: 7年前

内容简介:爬虫入库实战之干死反爬虫

爬虫因为需要大量抓取网页,所以有可能会被ban IP,所以通常使用加UA、代理、XFF等伪造真实IP等策略,其中X-Forwarded-For,Client-ip,REMOTE_ADDR可以使用burp的爆破模块,四个payload随机生成就行了,本文重点使用UA、代理IP测试。

0x00 设置编码

首先设置下默认编码。

文件编码
reload(sys)
sys.setdefaultencoding('utf-8')
数据库编码
CREATE DATABASE 'proxy' DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci;
数据传输编码
charset='utf8mb4'

0x01 设置UA

config={
        'NUM':10,
        'timeout':5,
        'USER_AGENTS':[
    "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
    "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
    "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
    "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
    "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",]}
headers = {'User-Agent': random.choice(config['USER_AGENTS'])}

可以使用字典方式,这里我为了简便就随机粘了几个。

爬虫入库实战之干死反爬虫

0x02 设置交互方式

数据交互使用 python 的PyMySQL模块,支持py2,py3。

conn = pymysql.connect(
 host='127.0.0.1',
 port=3306,
 user='root',
 passwd='123456',
 db='proxy',
 charset='utf8mb4',
 )
cur = conn.cursor()
sql = "INSERT INTO proxylist(title,price) VALUES ('test1','100')"
cur.execute(sql)

0x03 代理获取

代理使用免费的西刺代理(没钱-.-),使用requests、BeautifulSoup做数据处理。

r = requests.get(url=url_xichi,headers=headers)
soup = bs(r.content,'lxml')
datas = soup.find_all(name='tr',attrs={'class':re.compile('(odd)|()')})
for data in datas:
    proxys = data.find_all(name='td')
    ip = str(proxys[1].string)
    port = str(proxys[2].string)
    type = str(proxys[5].string).lower()
    avail_proxy = proxy_check(ip,port,type)
    if avail_proxy != None:
        return avail_proxy

0x04 代理验证

验证代理的存活性,使用站长 工具 的ip定位实现

try:
    r = requests.get(url=url_check,proxies=proxylist,timeout=5)
    find_ip = re.findall(r'\'(.*?)\'',r.text)[0]
    if ip == find_ip:
        return proxylist
except Exception,e:
    pass

0x05 数据入库

这里就以谷安网校的课程来爬,主要爬课程名字及对应的价格。

效果如下图:

爬虫入库实战之干死反爬虫

附:源代码

#!/usr/bin/python
# -*- coding: UTF-8 -*-
'''
@Author:W2n1ck
@Index:http://www.w2n1ck.com/
'''
import random
import re
import requests
import pymysql
import sys
import time

reload(sys)
sys.setdefaultencoding('utf-8')

from bs4 import BeautifulSoup as bs

config={
        'NUM':10,
        'timeout':5,
        'USER_AGENTS':[
    "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
    "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
    "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
    "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
    "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",]}
headers = {'User-Agent': random.choice(config['USER_AGENTS'])}
proxy = []

conn = pymysql.connect(
 host='127.0.0.1',
 port=3306,
 user='root',
 passwd='123456',
 db='proxy',
 charset='utf8mb4',
 )
cur = conn.cursor()
#sql = "INSERT INTO proxylist(title,price) VALUES ('test2','200')"
#cur.execute(sql)

def proxy_spider():
    url_xichi = 'http://www.xicidaili.com/nn/'
    r = requests.get(url=url_xichi,headers=headers)
    soup = bs(r.content,'lxml')
    datas = soup.find_all(name='tr',attrs={'class':re.compile('(odd)|()')})
    #print datas
    for data in datas:
        proxys = data.find_all(name='td')
        ip = str(proxys[1].string)
        #ip = 'http://'+ip
        port = str(proxys[2].string)
        type = str(proxys[5].string).lower()
        avail_proxy = proxy_check(ip,port,type)
        if avail_proxy != None:
            return avail_proxy
def proxy_check(ip,port,type):
    url_check = 'http://ip.chinaz.com/getip.aspx'
    proxylist = {}
    proxylist[type] = '%s:%s' % (ip,port)
    #print proxylist
    try:
        r = requests.get(url=url_check,proxies=proxylist,timeout=5)
        find_ip = re.findall(r'\'(.*?)\'',r.text)[0]
        #print find_ip
        if ip == find_ip:
            return proxylist
            #proxy.append(find_ip)
        #print proxy
    except Exception,e:
        pass

def decode_str(str):
    return str.replace(' ','').replace('\t','').replace('\n','').encode('utf-8')

def get_title_price(url):
    proxy_url = proxy_spider()
    r = requests.get(url=url,proxies=proxy_url,timeout=20)
    content = r.content
    soup = bs(content,'lxml')
    content_titles = soup.find_all(name='a',attrs={'class':'link-dark'})
    content_prices = soup.find_all(name='span',attrs={'class':'price'})
    for title,price in zip(content_titles,content_prices):
        tmp_title = decode_str(str(title.string))
        tmp_price = decode_str(str(price.string))
        print tmp_title,tmp_price
        sql = "INSERT INTO proxylist(title,price) VALUES ('%s','%s')"%(tmp_title,tmp_price)
        cur.execute(sql)
if __name__=='__main__':
    #proxy_spider()
    url_spider = 'http://edu.aqniu.com/course/explore?page='
    for i in range(12):
        spider_url = url_spider+str(i)
        get_title_price(spider_url)
        time.sleep(random.random()*1)
    cur.close()

以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

Foundations of PEAR

Foundations of PEAR

Good, Nathan A./ Kent, Allan / Springer-Verlag New York Inc / 2006-11 / $ 50.84

PEAR, the PHP Extension and Application Repository, is a bountiful resource for any PHP developer. Within its confines lie the tools that you need to do your job more quickly and efficiently. You need......一起来看看 《Foundations of PEAR》 这本书的介绍吧!

CSS 压缩/解压工具
CSS 压缩/解压工具

在线压缩/解压 CSS 代码

JSON 在线解析
JSON 在线解析

在线 JSON 格式化工具