Top 5 Beautiful Soup Functions That Will Make Your Life Easier

栏目: IT技术 · 发布时间: 4年前

Top 5 Beautiful Soup Functions That Will Make Your Life Easier

Top 5 Beautiful Soup Functions That Will Make Your Life Easier

Apr 8 ·5min read

O nce you get intoWeb Scraping and data processing, you will find so many tools that can do that job for you. One of them is Beautiful Soup , which is a python library for pulling data out of HTML and XML files. It creates data parse trees in order to get data easily.

Top 5 Beautiful Soup Functions That Will Make Your Life Easier

Original photo by Joshua Sortino on Unsplash

The basic process goes something like this:

Get the data and then process it any way you want.

That is why today I want to show you some of the top functions that Beautiful Soup has to offer.

If you are also interested in other libraries likeSelenium, here are other examples you should look into:

I have written articles about Selenium and Web Scraping before, so before you begin with these, I would recommend you read this article “ Everything About Web Scraping ”, because of the setup process. And if you are already more advanced with Web Scraping, try my advanced scripts like “ How to Save Money with Python ” and “ How to Make an Analysis Tool with Python ”.

Also, a good example of setting up the environment for BeautifulSoup is in the article “ How to Save Money with Python ”.

Let’s just jump right into it!

Top 5 Beautiful Soup Functions That Will Make Your Life Easier

Beautiful Soup Setup

Before we get into Top 5 Functions, we have to set up our environment and libraries that we are going to use in order to get data.

In that terminal you should install libraries:

pip3 install requests

Requestscan be used so you can add content like headers, form data, multipart files, and parameters via simple Python libraries. It also allows you to access the response data of Python in the same way.

sudo pip3 install beautifulsoup4

This is our main library Beautiful Soup that we already mentioned above.

Also when you start your Python script at the beginning you should include the libraries we just installed:

import requestsfrom bs4 import BeautifulSoup

Now let’s move on to the functions!

Top 5 Beautiful Soup Functions That Will Make Your Life Easier

get()

This function is absolutely essential since with it you will get to the certain web page you desire. Let me show you.

First, we have to find a URL we want to scrape (get data) from:

URL = 'https://www.amazon.de/gp/product/B0756CYWWD/ref=as_li_tl?ie=UTF8&tag=idk01e-21&camp=1638&creative=6742&linkCode=as2&creativeASIN=B0756CYWWD&linkId=868d0edc56c291dbff697d1692708240'headers = {"User-agent": 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'}

I took a random Amazon product and with the get function, we are going to get access to data from the web page. Headers are just a definition for your browser. You can check yours here .

Using the requests library we get to the desired URL with defined headers.

After that, we create an object instance ‘soup’ that we can use to find anything we want on the page.

page = requests.get(URL, headers=headers)soup = BeautifulSoup(page.content, 'html.parser')

BeautifulSoup(,) creates a data structure representing a parsed HTML or XML document.

Most of the methods you’ll call on a BeautifulSoup object are inherited from PageElement or Tag.

Internally, this class defines the basic interface called by the tree builders when converting an HTML/XML document into a data structure. The interface abstracts away the differences between parsers.

We can now move on to the next function, which actually searches the object we just created.

Top 5 Beautiful Soup Functions That Will Make Your Life Easier

find()

With the find() function, we are able to search for anything in our web page.

Let’s say we want to get a title and the price of the product based on their ids.

title = soup.find(id="productTitle").get_text()price = soup.find(id="priceblock_ourprice").get_text()

The id of these Web elements you can find by clicking F12 on your keyboard or right-click -> ‘ Inspect’.

Top 5 Beautiful Soup Functions That Will Make Your Life Easier

Let’s look closely at what just happened there!

Top 5 Beautiful Soup Functions That Will Make Your Life Easier

get_text()

As you can see in the previous function we used get_text() to extract the text part of the newly found elements title and price.

But before we get to the final results there are a few more things that we have to perform on our product in order to get perfect output.

Top 5 Beautiful Soup Functions That Will Make Your Life Easier

strip()

The strip() method returns a copy of the string with both leading and trailing characters removed (based on the string argument passed).

We use this function in order to remove the empty spaces we have in our title:

Top 5 Beautiful Soup Functions That Will Make Your Life Easier

This function can also be used in any other python usage, not just Beautiful Soup, but in my personal experience, it has come in handy so many times when operating on text elements and that is why I am putting it on this list.

Top 5 Beautiful Soup Functions That Will Make Your Life Easier

split()

This function also has a general-purpose for Python but I found it very useful as well.

It splits the string into different parts and we can use the parts that we desire.

It works with a combination of the separator and a string.

We use sep as the separator in our string for price and convert it to integer (whole number).

replace() just replaces ‘.’ with an empty string.

sep = ','
con_price = price.split(sep, 1)[0]
converted_price = int(con_price.replace('.', ''))

Top 5 Beautiful Soup Functions That Will Make Your Life Easier

Here are the final results:

I put the complete code for you in this Gist:

Just check your headers before you execute it.

If you want to run it, here is the terminal command:

python3 bs_tutorial.py

We are done!

Top 5 Beautiful Soup Functions That Will Make Your Life Easier

Last words

As mentioned before, this is not my first time writing about Beautiful Soup, Selenium and Web Scraping in general. There are many more functions I would love to cover and many more to come. I hope you liked this tutorial and in order to keep up, follow me for more!

Thanks for reading!

Top 5 Beautiful Soup Functions That Will Make Your Life Easier
Check out my other articles and follow me on Medium
Top 5 Beautiful Soup Functions That Will Make Your Life Easier
Follow me on Twitter for info when I get a new article out

以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

埃隆·马斯克传

埃隆·马斯克传

陆西 / 重庆出版社 / 2014-7 / 38.00元

埃隆·马斯克(Elon Musk)1971年出生于南非,毕业于美国宾夕法尼亚大学,工程师、企业家、亿万富翁。最成功的全球网上支付平台Paypal创始人之一,现任美国商业航天企业空间探索技术公司(SpaceX)CEO,电动车生产企业特斯拉汽车(Tesla Motors)CEO,光伏发电服务供应企业SolarCity董事长。如今他被视为经济危机后美国创新与科技实力的新象征。 今天PayPal依然......一起来看看 《埃隆·马斯克传》 这本书的介绍吧!

JSON 在线解析
JSON 在线解析

在线 JSON 格式化工具

html转js在线工具
html转js在线工具

html转js在线工具