Top 5 Beautiful Soup Functions That Will Make Your Life Easier

栏目: IT技术 · 发布时间: 4年前

Top 5 Beautiful Soup Functions That Will Make Your Life Easier

Top 5 Beautiful Soup Functions That Will Make Your Life Easier

Apr 8 ·5min read

O nce you get intoWeb Scraping and data processing, you will find so many tools that can do that job for you. One of them is Beautiful Soup , which is a python library for pulling data out of HTML and XML files. It creates data parse trees in order to get data easily.

Top 5 Beautiful Soup Functions That Will Make Your Life Easier

Original photo by Joshua Sortino on Unsplash

The basic process goes something like this:

Get the data and then process it any way you want.

That is why today I want to show you some of the top functions that Beautiful Soup has to offer.

If you are also interested in other libraries likeSelenium, here are other examples you should look into:

I have written articles about Selenium and Web Scraping before, so before you begin with these, I would recommend you read this article “ Everything About Web Scraping ”, because of the setup process. And if you are already more advanced with Web Scraping, try my advanced scripts like “ How to Save Money with Python ” and “ How to Make an Analysis Tool with Python ”.

Also, a good example of setting up the environment for BeautifulSoup is in the article “ How to Save Money with Python ”.

Let’s just jump right into it!

Top 5 Beautiful Soup Functions That Will Make Your Life Easier

Beautiful Soup Setup

Before we get into Top 5 Functions, we have to set up our environment and libraries that we are going to use in order to get data.

In that terminal you should install libraries:

pip3 install requests

Requestscan be used so you can add content like headers, form data, multipart files, and parameters via simple Python libraries. It also allows you to access the response data of Python in the same way.

sudo pip3 install beautifulsoup4

This is our main library Beautiful Soup that we already mentioned above.

Also when you start your Python script at the beginning you should include the libraries we just installed:

import requestsfrom bs4 import BeautifulSoup

Now let’s move on to the functions!

Top 5 Beautiful Soup Functions That Will Make Your Life Easier

get()

This function is absolutely essential since with it you will get to the certain web page you desire. Let me show you.

First, we have to find a URL we want to scrape (get data) from:

URL = 'https://www.amazon.de/gp/product/B0756CYWWD/ref=as_li_tl?ie=UTF8&tag=idk01e-21&camp=1638&creative=6742&linkCode=as2&creativeASIN=B0756CYWWD&linkId=868d0edc56c291dbff697d1692708240'headers = {"User-agent": 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'}

I took a random Amazon product and with the get function, we are going to get access to data from the web page. Headers are just a definition for your browser. You can check yours here .

Using the requests library we get to the desired URL with defined headers.

After that, we create an object instance ‘soup’ that we can use to find anything we want on the page.

page = requests.get(URL, headers=headers)soup = BeautifulSoup(page.content, 'html.parser')

BeautifulSoup(,) creates a data structure representing a parsed HTML or XML document.

Most of the methods you’ll call on a BeautifulSoup object are inherited from PageElement or Tag.

Internally, this class defines the basic interface called by the tree builders when converting an HTML/XML document into a data structure. The interface abstracts away the differences between parsers.

We can now move on to the next function, which actually searches the object we just created.

Top 5 Beautiful Soup Functions That Will Make Your Life Easier

find()

With the find() function, we are able to search for anything in our web page.

Let’s say we want to get a title and the price of the product based on their ids.

title = soup.find(id="productTitle").get_text()price = soup.find(id="priceblock_ourprice").get_text()

The id of these Web elements you can find by clicking F12 on your keyboard or right-click -> ‘ Inspect’.

Top 5 Beautiful Soup Functions That Will Make Your Life Easier

Let’s look closely at what just happened there!

Top 5 Beautiful Soup Functions That Will Make Your Life Easier

get_text()

As you can see in the previous function we used get_text() to extract the text part of the newly found elements title and price.

But before we get to the final results there are a few more things that we have to perform on our product in order to get perfect output.

Top 5 Beautiful Soup Functions That Will Make Your Life Easier

strip()

The strip() method returns a copy of the string with both leading and trailing characters removed (based on the string argument passed).

We use this function in order to remove the empty spaces we have in our title:

Top 5 Beautiful Soup Functions That Will Make Your Life Easier

This function can also be used in any other python usage, not just Beautiful Soup, but in my personal experience, it has come in handy so many times when operating on text elements and that is why I am putting it on this list.

Top 5 Beautiful Soup Functions That Will Make Your Life Easier

split()

This function also has a general-purpose for Python but I found it very useful as well.

It splits the string into different parts and we can use the parts that we desire.

It works with a combination of the separator and a string.

We use sep as the separator in our string for price and convert it to integer (whole number).

replace() just replaces ‘.’ with an empty string.

sep = ','
con_price = price.split(sep, 1)[0]
converted_price = int(con_price.replace('.', ''))

Top 5 Beautiful Soup Functions That Will Make Your Life Easier

Here are the final results:

I put the complete code for you in this Gist:

Just check your headers before you execute it.

If you want to run it, here is the terminal command:

python3 bs_tutorial.py

We are done!

Top 5 Beautiful Soup Functions That Will Make Your Life Easier

Last words

As mentioned before, this is not my first time writing about Beautiful Soup, Selenium and Web Scraping in general. There are many more functions I would love to cover and many more to come. I hope you liked this tutorial and in order to keep up, follow me for more!

Thanks for reading!

Top 5 Beautiful Soup Functions That Will Make Your Life Easier
Check out my other articles and follow me on Medium
Top 5 Beautiful Soup Functions That Will Make Your Life Easier
Follow me on Twitter for info when I get a new article out

以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

一个人的电商

一个人的电商

许晓辉 / 电子工业出版社 / 2015-5-1 / CNY 59.00

首次披露电商的运营与操盘内幕,徐小平、梁宁作序,雷军、陈彤、张向东、刘韧、王峰力荐! 这个时代在经历前所未有的转型甚至颠覆,任何行业都将与互联网无缝融合,成为“互联网+”。有很多写电商的书,大多都用浓墨重彩阐释互联网转型的必要性,而讲到如何落地实操则浅尝即止,令人心潮澎拜之后不知如何下手。于是有了这本既有方法论,更重视实操细节的书。 许晓辉,在知名电商公司凡客诚品做过高管,有海......一起来看看 《一个人的电商》 这本书的介绍吧!

RGB转16进制工具
RGB转16进制工具

RGB HEX 互转工具

在线进制转换器
在线进制转换器

各进制数互转换器

URL 编码/解码
URL 编码/解码

URL 编码/解码