python入门实践：如何爬取自如数据

栏目: Python · 发布时间: 5年前

内容简介：use python to catch the information from ziru（图片描述适合一起刚入门python的同学，我也是萌新，所以代码可能不是很优雅

首先代码地址奉上

https://github.com/liangyuqi/...

一、简介

use python to catch the information from ziru（ 彩蛋见最后 ）

图片描述

适合一起刚入门 python 的同学，我也是萌新，所以代码可能不是很优雅

爬取思路分析见 第五部分

二、环境

Python

python --version

(mac自带)

brew install python

pip

pip --version

pip 是 Python 包管理工具，该工具提供了对Python 包的查找、下载、安装、卸载的功能

curl https://bootstrap.pypa.io/get... -o get-pip.py

sudo python get-pip.py

三、安装依赖

pip freeze >package.txt

sudo pip install -r package.txt

四、启动

cd index

chmod a+x ziru_room.py

python ziru_room.py

五、思路分析

1.反反爬虫

一般公司都有安全部门，防止大规模的撞库或者带宽挤占，那爬取的时候肯定会被拦截，定位然后律师函警告。

所以我觉得 一个爬虫系统最重要的就是反反爬虫 。

我们先分析一下，一般简单的反爬虫什么思路？

用户请求的Headers，用户行为，网站目录和数据加载方式

headers里面主要根据 userAgent 查重。userAgent 属性是一个只读的字符串，声明了浏览器用于 HTTP 请求的用户代理头的值。简单来说就是浏览器向服务器”表明身份“用的。

用户行为主要靠ip。 ip 的话不用讲了，和身份证号差不多，所以我们发起请求应该用动态的，同一ip多次访问就可能被拉入ip黑名单，而且会导弹定位到你的服务器所在位置。

第三个方式比较高端了，我这次没有展示。前两种是爬虫伪装成浏览器读取数据，但是第三种是模拟出一个浏览器进行用户点击提交等操作，它本身就是一个没有界面的浏览器，从填写表单到点击按钮再到滚动页面，全部都可以模拟。这时候就可以根据一些其它方式，如识别点触式（12306）或者滑动式的验证码。

整理好思路开始实现，我们的目标是实现一个 动态的ip和userAgent池 ，每次请求伪装成不一样的来源

step1: 我们去爬取一个开放代理ip的网站。。。然后试试他开放的ip可不可用，可用的话加入我们的ip池。详见代码 ziru_room.py

# 经测试可用ip
    usefulIp = []

    # 获取代理ip地址
    uriGetIp = 'http://www.xicidaili.com/wt/'

    # 检测ip是否可用地址
    testGetIp = 'http://icanhazip.com/'

    usefulIp = getUsefulIPList(uriGetIp, testGetIp, userAgent)

'''
获取可用的ip列表
'''
def getUsefulIPList(uriGetIp, testGetIp, userAgent):
    # 全部代理ip
    allProxys = []

    # 经测试可用ip
    usefulIp = []
    ipList = requests.get(
        uriGetIp, headers={'User-Agent': random.choice(userAgent)})

    ipData = bs4.BeautifulSoup(ipList.text, "html.parser")

    ip = ipData.select("#ip_list > tr > td:nth-of-type(2)")

    port = ipData.select("#ip_list > tr > td:nth-of-type(3)")

    protocol = ipData.select("#ip_list > tr > td:nth-of-type(6)")

    for ip, port, protocol in zip(ip, port, protocol):
        proxy = ip.get_text().strip()+':'+port.get_text().strip()
        allProxys.append(proxy)

    print('正在初始化ip数据池，请耐心等待...')

    process.max_steps = len(allProxys)

    process.process_bar = process.ShowProcess(process.max_steps)

    # 筛选可用ip
    for proxy in allProxys:
        process.process_bar.show_process()
        # time.sleep(0.05)
        try:
            theIp = requests.get(testGetIp,  headers={'User-Agent': random.choice(userAgent)}, proxies={
                'http': proxy}, timeout=1, allow_redirects=False)
        except requests.exceptions.Timeout:
            # print('超过1s')
            continue
        except requests.exceptions.ConnectionError:
            # print('连接异常')
            continue
        except requests.exceptions.HTTPError:
            # print('http异常')
            continue
        except:
            # print("其他错误")
            continue
        else:
            if (theIp.status_code == 200 and len(theIp.text) < 20):
                usefulIp.append(proxy)
            #    print(theIp.text)

    print('可用ip池为下：'+','.join(usefulIp))
    return usefulIp

step2: 构造userAgent池

userAgent = ['Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36',
                 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0',
                 'Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10'
                 ]

这个不像ip会经常挂，所以写死问题不大。

2.爬取数据

我们的原材料准备好了，开始爬取，可以看见用的是 random.choice() 去ip，userAgent池取得随机配置组成 get请求。详见代码 ziru_room.py

def computedData(usefulIp, userAgent, ipIndex=0):
    # debugger
    # pdb.set_trace()
    fhandle = open('../output/output.txt', 'a')  # 追加写入文件

    # Get请求-并传递headers
    try:
        data = requests.get("http://www.ziroom.com/z/nl/z3-r3-o2-s5%E5%8F%B7%E7%BA%BF-t%E5%8C%97%E8%8B%91%E8%B7%AF%E5%8C%97.html",
                            headers={'User-Agent': random.choice(userAgent)}, proxies={'http': random.choice(usefulIp)}, timeout=(3, 7))
        #
        pass
    except:
        print "Error: 请求失败"
        computedData(usefulIp, userAgent)
        return
        pass
    else:
        roomDate = bs4.BeautifulSoup(data.text, "html.parser")
        # 标题
        title = roomDate.select("#houseList > li > div.txt > h3 > a")
        # 地点 改版没了////
        # place = roomDate.select("#houseList > li > div.txt > h4 > a")
        # 距离
        distance = roomDate.select(
            "#houseList > li > div.txt > div > p:nth-of-type(2) > span")
        # 价格
        price = roomDate.select("#houseList > li > div.priceDetail > p.price")
        # 面积
        area = roomDate.select(
            "#houseList > li > div.txt > div > p:nth-of-type(1) > span:nth-of-type(1)")
        # 楼层
        floor = roomDate.select(
            "#houseList > li > div.txt > div > p:nth-of-type(1) > span:nth-of-type(2)")
        # 房间配置
        room = roomDate.select(
            "#houseList > li > div.txt > div > p:nth-of-type(1) > span:nth-of-type(3)")
        #
        print('北京市自如数据如下')
        fhandle.write('北京市'+time.strftime("%Y-%m-%d %H:%M:%S",
                                          time.localtime()) + '自如数据如下'+'\n')

        for title, price, area, floor, room, distance in zip(title, price, area, floor, room, distance):
            last_data = {
                "名称": title.get_text().strip(),
                # "地段": place.get_text().strip(),
                "距离": distance.get_text().strip(),
                "价格": price.get_text().replace(' ', '').replace('\n', ''),
                "面积": area.get_text().strip(),
                "楼层": floor.get_text().strip(),
                "房间大小": room.get_text().strip()
            }

            fhandle.write("名称："+title.get_text().strip())
            # fhandle.write("地段："+place.get_text().strip())
            fhandle.write("距离："+distance.get_text().strip())
            fhandle.write(
                "价格："+price.get_text().replace(' ', '').replace('\n', ''))
            fhandle.write("面积："+area.get_text().strip())
            fhandle.write("楼层："+floor.get_text().strip())
            fhandle.write("房间大小："+room.get_text().strip() + '\n')

            # print  json.dumps(last_data).decode('unicode-escape')
            # print  json.dumps(last_data,ensure_ascii=False)
            print json.dumps(last_data, encoding='UTF-8', ensure_ascii=False)

        fhandle.write("************************************************"+'\n')
        fhandle.close()
        print(time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()))
        pass

3.其他部分

因为爬取可用的ip组成ip池，是一个比较耗时的过程，所以加入了图像化的等待显示，详见代码 process.py

图片描述

自动化爬取有点节操，所以得加入延时，详见代码 ziru_room.py

while(True):
        computedData(usefulIp, userAgent)
        time.sleep(60)

python 一点其他感触，写起来很简洁，这个换行缩进还有dict对象中文Unicode搞了很久。。。目前和node相比优缺点在哪里还没有分析好，可以留言探讨下。

码字辛苦，代码粗糙后续会有优化，点小手star一下谢谢

https://github.com/liangyuqi/...

最后送上彩蛋，这位老哥最后根据github 里qq 找到的我，反反爬虫不算太失败吧，爬取的也不是什么关键数据，手动滑稽，仅供萌新学习练手

python入门实践：如何爬取自如数据

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

Tagging

Gene Smith / New Riders / 2007-12-27 / GBP 28.99

Tagging is fast becoming one of the primary ways people organize and manage digital information. Tagging complements traditional organizational tools like folders and search on users desktops as well ......一起来看看《Tagging》这本书的介绍吧!

码农工具

python入门实践：如何爬取自如数据

一、简介

二、环境

Python

pip

三、安装依赖

四、启动

五、思路分析

1.反反爬虫

用户请求的Headers，用户行为，网站目录和数据加载方式

step2: 构造userAgent池

2.爬取数据

3.其他部分

Tagging

JSON 在线解析

RGB转16进制工具

XML 在线格式化