Top 5 Beautiful Soup Functions That Will Make Your Life Easier
Apr 8 ·5min read
O nce you get intoWeb Scraping and data processing, you will find so many tools that can do that job for you. One of them is Beautiful Soup , which is a python library for pulling data out of HTML and XML files. It creates data parse trees in order to get data easily.
The basic process goes something like this:
Get the data and then process it any way you want.
That is why today I want to show you some of the top functions that Beautiful Soup has to offer.
If you are also interested in other libraries likeSelenium, here are other examples you should look into:
I have written articles about Selenium and Web Scraping before, so before you begin with these, I would recommend you read this article “ Everything About Web Scraping ”, because of the setup process. And if you are already more advanced with Web Scraping, try my advanced scripts like “ How to Save Money with Python ” and “ How to Make an Analysis Tool with Python ”.
Also, a good example of setting up the environment for BeautifulSoup is in the article “ How to Save Money with Python ”.
Let’s just jump right into it!
Beautiful Soup Setup
Before we get into Top 5 Functions, we have to set up our environment and libraries that we are going to use in order to get data.
In that terminal you should install libraries:
pip3 install requests
Requestscan be used so you can add content like headers, form data, multipart files, and parameters via simple Python libraries. It also allows you to access the response data of Python in the same way.
sudo pip3 install beautifulsoup4
This is our main library Beautiful Soup that we already mentioned above.
Also when you start your Python script at the beginning you should include the libraries we just installed:
import requestsfrom bs4 import BeautifulSoup
Now let’s move on to the functions!
get()
This function is absolutely essential since with it you will get to the certain web page you desire. Let me show you.
First, we have to find a URL we want to scrape (get data) from:
URL = 'https://www.amazon.de/gp/product/B0756CYWWD/ref=as_li_tl?ie=UTF8&tag=idk01e-21&camp=1638&creative=6742&linkCode=as2&creativeASIN=B0756CYWWD&linkId=868d0edc56c291dbff697d1692708240'headers = {"User-agent": 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'}
I took a random Amazon product and with the get function, we are going to get access to data from the web page. Headers are just a definition for your browser. You can check yours here .
Using the requests library we get to the desired URL with defined headers.
After that, we create an object instance ‘soup’ that we can use to find anything we want on the page.
page = requests.get(URL, headers=headers)soup = BeautifulSoup(page.content, 'html.parser')
BeautifulSoup(,) creates a data structure representing a parsed HTML or XML document.
Most of the methods you’ll call on a BeautifulSoup object are inherited from PageElement or Tag.
Internally, this class defines the basic interface called by the tree builders when converting an HTML/XML document into a data structure. The interface abstracts away the differences between parsers.
We can now move on to the next function, which actually searches the object we just created.
find()
With the find() function, we are able to search for anything in our web page.
Let’s say we want to get a title and the price of the product based on their ids.
title = soup.find(id="productTitle").get_text()price = soup.find(id="priceblock_ourprice").get_text()
The id of these Web elements you can find by clicking F12 on your keyboard or right-click -> ‘ Inspect’.
Let’s look closely at what just happened there!
get_text()
As you can see in the previous function we used get_text() to extract the text part of the newly found elements title and price.
But before we get to the final results there are a few more things that we have to perform on our product in order to get perfect output.
strip()
The strip() method returns a copy of the string with both leading and trailing characters removed (based on the string argument passed).
We use this function in order to remove the empty spaces we have in our title:
This function can also be used in any other python usage, not just Beautiful Soup, but in my personal experience, it has come in handy so many times when operating on text elements and that is why I am putting it on this list.
split()
This function also has a general-purpose for Python but I found it very useful as well.
It splits the string into different parts and we can use the parts that we desire.
It works with a combination of the separator and a string.
We use sep as the separator in our string for price and convert it to integer (whole number).
replace() just replaces ‘.’ with an empty string.
sep = ',' con_price = price.split(sep, 1)[0] converted_price = int(con_price.replace('.', ''))
Here are the final results:
I put the complete code for you in this Gist:
Just check your headers before you execute it.
If you want to run it, here is the terminal command:
python3 bs_tutorial.py
We are done!
Last words
As mentioned before, this is not my first time writing about Beautiful Soup, Selenium and Web Scraping in general. There are many more functions I would love to cover and many more to come. I hope you liked this tutorial and in order to keep up, follow me for more!
Thanks for reading!
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。