内容简介:爬虫入库实战之干死反爬虫
爬虫因为需要大量抓取网页,所以有可能会被ban IP,所以通常使用加UA、代理、XFF等伪造真实IP等策略,其中X-Forwarded-For,Client-ip,REMOTE_ADDR可以使用burp的爆破模块,四个payload随机生成就行了,本文重点使用UA、代理IP测试。
0x00 设置编码
首先设置下默认编码。
文件编码 reload(sys) sys.setdefaultencoding('utf-8') 数据库编码 CREATE DATABASE 'proxy' DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci; 数据传输编码 charset='utf8mb4'
0x01 设置UA
config={ 'NUM':10, 'timeout':5, 'USER_AGENTS':[ "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)", "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6", "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1", "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0", "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5", "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6", "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",]} headers = {'User-Agent': random.choice(config['USER_AGENTS'])}
可以使用字典方式,这里我为了简便就随机粘了几个。
0x02 设置交互方式
数据交互使用 python 的PyMySQL模块,支持py2,py3。
conn = pymysql.connect( host='127.0.0.1', port=3306, user='root', passwd='123456', db='proxy', charset='utf8mb4', ) cur = conn.cursor() sql = "INSERT INTO proxylist(title,price) VALUES ('test1','100')" cur.execute(sql)
0x03 代理获取
代理使用免费的西刺代理(没钱-.-),使用requests、BeautifulSoup做数据处理。
r = requests.get(url=url_xichi,headers=headers) soup = bs(r.content,'lxml') datas = soup.find_all(name='tr',attrs={'class':re.compile('(odd)|()')}) for data in datas: proxys = data.find_all(name='td') ip = str(proxys[1].string) port = str(proxys[2].string) type = str(proxys[5].string).lower() avail_proxy = proxy_check(ip,port,type) if avail_proxy != None: return avail_proxy
0x04 代理验证
验证代理的存活性,使用站长 工具 的ip定位实现
try: r = requests.get(url=url_check,proxies=proxylist,timeout=5) find_ip = re.findall(r'\'(.*?)\'',r.text)[0] if ip == find_ip: return proxylist except Exception,e: pass
0x05 数据入库
这里就以谷安网校的课程来爬,主要爬课程名字及对应的价格。
效果如下图:
附:源代码
#!/usr/bin/python # -*- coding: UTF-8 -*- ''' @Author:W2n1ck @Index:http://www.w2n1ck.com/ ''' import random import re import requests import pymysql import sys import time reload(sys) sys.setdefaultencoding('utf-8') from bs4 import BeautifulSoup as bs config={ 'NUM':10, 'timeout':5, 'USER_AGENTS':[ "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)", "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6", "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1", "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0", "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5", "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6", "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",]} headers = {'User-Agent': random.choice(config['USER_AGENTS'])} proxy = [] conn = pymysql.connect( host='127.0.0.1', port=3306, user='root', passwd='123456', db='proxy', charset='utf8mb4', ) cur = conn.cursor() #sql = "INSERT INTO proxylist(title,price) VALUES ('test2','200')" #cur.execute(sql) def proxy_spider(): url_xichi = 'http://www.xicidaili.com/nn/' r = requests.get(url=url_xichi,headers=headers) soup = bs(r.content,'lxml') datas = soup.find_all(name='tr',attrs={'class':re.compile('(odd)|()')}) #print datas for data in datas: proxys = data.find_all(name='td') ip = str(proxys[1].string) #ip = 'http://'+ip port = str(proxys[2].string) type = str(proxys[5].string).lower() avail_proxy = proxy_check(ip,port,type) if avail_proxy != None: return avail_proxy def proxy_check(ip,port,type): url_check = 'http://ip.chinaz.com/getip.aspx' proxylist = {} proxylist[type] = '%s:%s' % (ip,port) #print proxylist try: r = requests.get(url=url_check,proxies=proxylist,timeout=5) find_ip = re.findall(r'\'(.*?)\'',r.text)[0] #print find_ip if ip == find_ip: return proxylist #proxy.append(find_ip) #print proxy except Exception,e: pass def decode_str(str): return str.replace(' ','').replace('\t','').replace('\n','').encode('utf-8') def get_title_price(url): proxy_url = proxy_spider() r = requests.get(url=url,proxies=proxy_url,timeout=20) content = r.content soup = bs(content,'lxml') content_titles = soup.find_all(name='a',attrs={'class':'link-dark'}) content_prices = soup.find_all(name='span',attrs={'class':'price'}) for title,price in zip(content_titles,content_prices): tmp_title = decode_str(str(title.string)) tmp_price = decode_str(str(price.string)) print tmp_title,tmp_price sql = "INSERT INTO proxylist(title,price) VALUES ('%s','%s')"%(tmp_title,tmp_price) cur.execute(sql) if __name__=='__main__': #proxy_spider() url_spider = 'http://edu.aqniu.com/course/explore?page=' for i in range(12): spider_url = url_spider+str(i) get_title_price(spider_url) time.sleep(random.random()*1) cur.close()
以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网
猜你喜欢:- redis缓存队列+MySQL +php任务脚本定时批量入库
- 如何在django里上传csv文件并进行入库处理
- Iceberg集成|Iceberg在基于Flink的流式数据入库场景中的应用
- bp(net core)+easyui+efcore实现仓储管理系统——入库管理之二(三十八)
- 爬虫需谨慎,那些你不知道的爬虫与反爬虫套路!
- 反爬虫之字体反爬虫
本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
Foundations of PEAR
Good, Nathan A./ Kent, Allan / Springer-Verlag New York Inc / 2006-11 / $ 50.84
PEAR, the PHP Extension and Application Repository, is a bountiful resource for any PHP developer. Within its confines lie the tools that you need to do your job more quickly and efficiently. You need......一起来看看 《Foundations of PEAR》 这本书的介绍吧!
CSS 压缩/解压工具
在线压缩/解压 CSS 代码
JSON 在线解析
在线 JSON 格式化工具