Web scraping with Python & BeautifulSoup

栏目: IT技术 · 发布时间: 4年前

The web contains lots of data. The ability to extract the information you need from it is, with no doubt, a useful one, even necessary. Of course, there are still lots of datasets already available for you to download, on places like Kaggle, but in many cases, you won’t find the exact data that you need for your particular problem. However, chances are you’ll find what you need somewhere on the web and you’ll need to extract it from there.

Web scraping is the process of doing this, of extracting data from web pages. In this article, we’ll see how to do web scraping in python. For this task, there are several libraries that you can use. Among these, here we will use Beautiful Soup 4 . This library takes care of extracting data from a HTML document, not downloading it. For downloading web pages, we need to use another library: requests .

So, we’ll need 2 packages:

requests — for downloading the HTML code from a given URL
beautiful soup — for extracting data from that HTML string

Installing the libraries

Now, let’s start by installing the required packages. Open a terminal window and type:

python -m pip install requests beautifulsoup4

…or, if you’re using a conda environment:

conda install requests beautifulsoup4

Now, try to run the following:

import requests
from bs4 import BeautifulSoup

If you don’t get any error, then the packages are installed successfully.

Using requests & beautiful soup to extract data

From the requests package we will use the get() function to download a web page from a given URL:

requests.get(url, params=None, **kwargs)

Where the parameters are:

url — url of the desired web page
params — a optional dictionary, list of tuples or bytes to send in the query string
**kwargs — optional arguments that request takes

This function returns an object of type requests.Response . Among this object's attributes and methods, we are most interested in the .content attribute which consists of the HTML string of the target web page.

Example:

html_string = requests.get("http://www.example.com")

After we got the HTML of the target web page, we have to use the BeautifulSoup() constructor to parse it, and get an BeautifulSoup object that we can use to navigate the document tree and extract the data that we need.

soup = BeautifulSoup(markup_string, parser)

Where:

markup_string — the string of our web page
parser — a string consisting of the name of the parser to be used; here we will use python’s default parser: “html.parser”

Note that we named the first parameter as “markup_string” instead of “html_string” because BeautifulSoup can be used with other markup languages as well, not just HTML, but we need to specify an appropriate parser; e.g. we can parse XML by passing “xml” as parser.

A BeautifulSoup object has several methods and attributes that we can use to navigate within the parsed document and extract data from it.

The most used method is .find_all() :

soup.find_all(name, attrs, recursive, string, limit, **kwargs)

name — name of the tag; e.g. “a”, “div”, “img”
attrs — a dictionary with the tag’s attributes; e.g. {“class”: “nav”, “href”: “#menuitem”}
recursive — boolean; if false only direct children are considered, if true (default) all children are examined in the search
string — used to search for strings in the element’s content
limit — limit the search to only this number of found elements

Example:

soup.find_all("a", attrs={"class": "nav", "data-foo": "value"})

The line above returns a list with all “a” elements that also have the specified attributes.

HTML attributes that can not be confused with this method’s parameters or python’s keywords (like “class”) can be used directly as function parameters without the need to put them inside attrs dictionary. The HTML class attribute can also be used like this but instead of class=”…” write class_=”…” .

Example:

soup.find_all("a", class_="nav")

Because this method is the most used one, it has a shortcut: calling the BeautifulSoup object directly has the same effect as calling the .find_all() method.

Example:

soup("a", class_="nav")

The .find() method is like .find_all() , but it stops the search after it founds the first element; element which will be returned. It is roughly equivalent to .find_all(..., limit=1) , but instead of returning a list, it returns a single element.

The .contents attribute of a BeautifulSoup object is a list with all its children elements. If the current element does not contain nested HTML elements, then .contents[0] will be just the text inside it. So after we got the element that contains the data we need using the .find_all() or .find() methods, all we need to do to get the data inside it is to access .contents[0] .

Example:

soup = BeautifulSoup('''
    <div>
        <span>5</span>
        <span>100</span>
    </div>
''', "html.parser")views = soup.find("span", class_="views").contents[0]

What if we need a piece of data that is not inside the element, but as the value of an attribute? We can access an element’s attribute value as follows:

soup['attr_name']

Example:

soup = BeautifulSoup('''
    <div>
        <img src="./img1.png">
    </div>
''', "html.parser")img_source = soup.find("img")['src']

Web scraping example: get top 10 linux distros

Now, let’s see a simple web scraping example using the concepts above. We will extract a list with the top 10 most popular linux distros from DistroWatch website. DistroWatch ( https://distrowatch.com/ ) is a website featuring news about linux distros and open source software that runs on linux. This website has in the right side a ranking with the most popular linux distros. From this ranking we will extract the first 10.

Firstly, we will download the web page and construct a BeautifulSoup object from it:

import requests
from bs4 import BeautifulSoup

soup = BeautifulSoup(
    requests.get("https://distrowatch.com/").content,
    "html.parser")

Then, we need to find out how to identify the data we want inside the HTML code. For that, we will use chrome’s developer tools. Right click somewhere in the web page and then click on “Inspect”, or press “Ctrl+Shift+I” in order to open chrome’s developer tools. It should look like this:

Then, if you click on the little arrow in the top-left corner of the developer tools, and then click on some element on the web page, you should see in the dev tools window the piece of HTML associated with that element. After that you can use the information that you saw in the dev tools window to tell beautiful soup where to find that element.

In our example, we can see that that ranking is structured as a HTML table and each distro name is inside a td element with class “phr2”. Then inside that td element is a link containing the text we want to extract (the distro’s name). That’s what we will do in the next few lines of code:

top_ten_distros = []
distro_tds = soup("td", class_="phr2", limit=10)
for td in distro_tds:
    top_ten_distros.append(td.find("a").contents[0])

And this is what we got:

I hope you found this article useful, and thanks for reading!

以上所述就是小编给大家介绍的《Web scraping with Python & BeautifulSoup》，希望对大家有所帮助，如果大家有任何疑问请给我留言，小编会及时回复大家的。在此也非常感谢大家对码农网的支持！

查看所有标签

猜你喜欢:

Web scraping with Python & BeautifulSoup

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

算法设计与分析

陈慧南 / 电子工业出版社 / 2006-5 / 26.80元

《算法设计与分析:C++语言描述》内容分为3部分：算法和算法分析、算法设计策略及求解困难问题。第1部分介绍问题求解方法、算法复杂度和分析、递归算法和递推关系；第2部分讨论常用的算法设计策略：基本搜索和遍历方法、分治法、贪心法、动态规划法、回溯法和分枝限界法；第3部分介绍NP完全问题、随机算法、近似算法和密码算法。书中还介绍了两种新的数据结构：跳表和伸展树，以及它们特定的算法分析方法，并对现代密码学......一起来看看《算法设计与分析》这本书的介绍吧!

码农工具