Web scraping with Python & BeautifulSoup

栏目: IT技术 · 发布时间: 4年前

Web scraping with Python & BeautifulSoup

Image by James Osborne from Pixabay

The web contains lots of data. The ability to extract the information you need from it is, with no doubt, a useful one, even necessary. Of course, there are still lots of datasets already available for you to download, on places like Kaggle, but in many cases, you won’t find the exact data that you need for your particular problem. However, chances are you’ll find what you need somewhere on the web and you’ll need to extract it from there.

Web scraping is the process of doing this, of extracting data from web pages. In this article, we’ll see how to do web scraping in python. For this task, there are several libraries that you can use. Among these, here we will use Beautiful Soup 4 . This library takes care of extracting data from a HTML document, not downloading it. For downloading web pages, we need to use another library: requests .

So, we’ll need 2 packages:

  • requests — for downloading the HTML code from a given URL
  • beautiful soup — for extracting data from that HTML string

Installing the libraries

Now, let’s start by installing the required packages. Open a terminal window and type:

python -m pip install requests beautifulsoup4

…or, if you’re using a conda environment:

conda install requests beautifulsoup4

Now, try to run the following:

import requests
from bs4 import BeautifulSoup

If you don’t get any error, then the packages are installed successfully.

Using requests & beautiful soup to extract data

From the requests package we will use the get() function to download a web page from a given URL:

requests.get(url, params=None, **kwargs)

Where the parameters are:

  • url — url of the desired web page
  • params — a optional dictionary, list of tuples or bytes to send in the query string
  • **kwargs — optional arguments that request takes

This function returns an object of type requests.Response . Among this object's attributes and methods, we are most interested in the .content attribute which consists of the HTML string of the target web page.

Example:

html_string = requests.get("http://www.example.com")

After we got the HTML of the target web page, we have to use the BeautifulSoup() constructor to parse it, and get an BeautifulSoup object that we can use to navigate the document tree and extract the data that we need.

soup = BeautifulSoup(markup_string, parser)

Where:

  • markup_string — the string of our web page
  • parser — a string consisting of the name of the parser to be used; here we will use python’s default parser: “html.parser”

Note that we named the first parameter as “markup_string” instead of “html_string” because BeautifulSoup can be used with other markup languages as well, not just HTML, but we need to specify an appropriate parser; e.g. we can parse XML by passing “xml” as parser.

A BeautifulSoup object has several methods and attributes that we can use to navigate within the parsed document and extract data from it.

The most used method is .find_all() :

soup.find_all(name, attrs, recursive, string, limit, **kwargs)
  • name — name of the tag; e.g. “a”, “div”, “img”
  • attrs — a dictionary with the tag’s attributes; e.g. {“class”: “nav”, “href”: “#menuitem”}
  • recursive — boolean; if false only direct children are considered, if true (default) all children are examined in the search
  • string — used to search for strings in the element’s content
  • limit — limit the search to only this number of found elements

Example:

soup.find_all("a", attrs={"class": "nav", "data-foo": "value"})

The line above returns a list with all “a” elements that also have the specified attributes.

HTML attributes that can not be confused with this method’s parameters or python’s keywords (like “class”) can be used directly as function parameters without the need to put them inside attrs dictionary. The HTML class attribute can also be used like this but instead of class=”…” write class_=”…” .

Example:

soup.find_all("a", class_="nav")

Because this method is the most used one, it has a shortcut: calling the BeautifulSoup object directly has the same effect as calling the .find_all() method.

Example:

soup("a", class_="nav")

The .find() method is like .find_all() , but it stops the search after it founds the first element; element which will be returned. It is roughly equivalent to .find_all(..., limit=1) , but instead of returning a list, it returns a single element.

The .contents attribute of a BeautifulSoup object is a list with all its children elements. If the current element does not contain nested HTML elements, then .contents[0] will be just the text inside it. So after we got the element that contains the data we need using the .find_all() or .find() methods, all we need to do to get the data inside it is to access .contents[0] .

Example:

soup = BeautifulSoup('''
<div>
<span>5</span>
<span>100</span>
</div>
''', "html.parser")
views = soup.find("span", class_="views").contents[0]

What if we need a piece of data that is not inside the element, but as the value of an attribute? We can access an element’s attribute value as follows:

soup['attr_name']

Example:

soup = BeautifulSoup('''
<div>
<img src="./img1.png">
</div>
''', "html.parser")
img_source = soup.find("img")['src']

Web scraping example: get top 10 linux distros

Now, let’s see a simple web scraping example using the concepts above. We will extract a list with the top 10 most popular linux distros from DistroWatch website. DistroWatch ( https://distrowatch.com/ ) is a website featuring news about linux distros and open source software that runs on linux. This website has in the right side a ranking with the most popular linux distros. From this ranking we will extract the first 10.

Firstly, we will download the web page and construct a BeautifulSoup object from it:

import requests
from bs4 import BeautifulSoup

soup = BeautifulSoup(
requests.get("https://distrowatch.com/").content,
"html.parser")

Then, we need to find out how to identify the data we want inside the HTML code. For that, we will use chrome’s developer tools. Right click somewhere in the web page and then click on “Inspect”, or press “Ctrl+Shift+I” in order to open chrome’s developer tools. It should look like this:

Web scraping with Python & BeautifulSoup

Then, if you click on the little arrow in the top-left corner of the developer tools, and then click on some element on the web page, you should see in the dev tools window the piece of HTML associated with that element. After that you can use the information that you saw in the dev tools window to tell beautiful soup where to find that element.

In our example, we can see that that ranking is structured as a HTML table and each distro name is inside a td element with class “phr2”. Then inside that td element is a link containing the text we want to extract (the distro’s name). That’s what we will do in the next few lines of code:

top_ten_distros = []
distro_tds = soup("td", class_="phr2", limit=10)
for td in distro_tds:
top_ten_distros.append(td.find("a").contents[0])

And this is what we got:

Web scraping with Python & BeautifulSoup

I hope you found this article useful, and thanks for reading!


以上所述就是小编给大家介绍的《Web scraping with Python & BeautifulSoup》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

P2P网贷投资手册

P2P网贷投资手册

徐红伟 / 同济大学出版社 / 2015-4 / CNY 28.00

《P2P网贷投资手册》由“P2P网络借贷知多少”、“新手如何开始P2P网贷投资”和“如何确定适合自己的网贷投资策略”三部分组成。将网贷之家平台上众多投资人和从业者的智慧集结成册,分享给网贷投资上的同路人。一起来看看 《P2P网贷投资手册》 这本书的介绍吧!

JS 压缩/解压工具
JS 压缩/解压工具

在线压缩/解压 JS 代码

在线进制转换器
在线进制转换器

各进制数互转换器

MD5 加密
MD5 加密

MD5 加密工具