Top 5 Beautiful Soup Functions That Will Make Your Life Easier

栏目: IT技术 · 发布时间: 4年前

Top 5 Beautiful Soup Functions That Will Make Your Life Easier

Top 5 Beautiful Soup Functions That Will Make Your Life Easier

Apr 8 ·5min read

O nce you get intoWeb Scraping and data processing, you will find so many tools that can do that job for you. One of them is Beautiful Soup , which is a python library for pulling data out of HTML and XML files. It creates data parse trees in order to get data easily.

Top 5 Beautiful Soup Functions That Will Make Your Life Easier

Original photo by Joshua Sortino on Unsplash

The basic process goes something like this:

Get the data and then process it any way you want.

That is why today I want to show you some of the top functions that Beautiful Soup has to offer.

If you are also interested in other libraries likeSelenium, here are other examples you should look into:

I have written articles about Selenium and Web Scraping before, so before you begin with these, I would recommend you read this article “ Everything About Web Scraping ”, because of the setup process. And if you are already more advanced with Web Scraping, try my advanced scripts like “ How to Save Money with Python ” and “ How to Make an Analysis Tool with Python ”.

Also, a good example of setting up the environment for BeautifulSoup is in the article “ How to Save Money with Python ”.

Let’s just jump right into it!

Top 5 Beautiful Soup Functions That Will Make Your Life Easier

Beautiful Soup Setup

Before we get into Top 5 Functions, we have to set up our environment and libraries that we are going to use in order to get data.

In that terminal you should install libraries:

pip3 install requests

Requestscan be used so you can add content like headers, form data, multipart files, and parameters via simple Python libraries. It also allows you to access the response data of Python in the same way.

sudo pip3 install beautifulsoup4

This is our main library Beautiful Soup that we already mentioned above.

Also when you start your Python script at the beginning you should include the libraries we just installed:

import requestsfrom bs4 import BeautifulSoup

Now let’s move on to the functions!

Top 5 Beautiful Soup Functions That Will Make Your Life Easier

get()

This function is absolutely essential since with it you will get to the certain web page you desire. Let me show you.

First, we have to find a URL we want to scrape (get data) from:

URL = 'https://www.amazon.de/gp/product/B0756CYWWD/ref=as_li_tl?ie=UTF8&tag=idk01e-21&camp=1638&creative=6742&linkCode=as2&creativeASIN=B0756CYWWD&linkId=868d0edc56c291dbff697d1692708240'headers = {"User-agent": 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'}

I took a random Amazon product and with the get function, we are going to get access to data from the web page. Headers are just a definition for your browser. You can check yours here .

Using the requests library we get to the desired URL with defined headers.

After that, we create an object instance ‘soup’ that we can use to find anything we want on the page.

page = requests.get(URL, headers=headers)soup = BeautifulSoup(page.content, 'html.parser')

BeautifulSoup(,) creates a data structure representing a parsed HTML or XML document.

Most of the methods you’ll call on a BeautifulSoup object are inherited from PageElement or Tag.

Internally, this class defines the basic interface called by the tree builders when converting an HTML/XML document into a data structure. The interface abstracts away the differences between parsers.

We can now move on to the next function, which actually searches the object we just created.

Top 5 Beautiful Soup Functions That Will Make Your Life Easier

find()

With the find() function, we are able to search for anything in our web page.

Let’s say we want to get a title and the price of the product based on their ids.

title = soup.find(id="productTitle").get_text()price = soup.find(id="priceblock_ourprice").get_text()

The id of these Web elements you can find by clicking F12 on your keyboard or right-click -> ‘ Inspect’.

Top 5 Beautiful Soup Functions That Will Make Your Life Easier

Let’s look closely at what just happened there!

Top 5 Beautiful Soup Functions That Will Make Your Life Easier

get_text()

As you can see in the previous function we used get_text() to extract the text part of the newly found elements title and price.

But before we get to the final results there are a few more things that we have to perform on our product in order to get perfect output.

Top 5 Beautiful Soup Functions That Will Make Your Life Easier

strip()

The strip() method returns a copy of the string with both leading and trailing characters removed (based on the string argument passed).

We use this function in order to remove the empty spaces we have in our title:

Top 5 Beautiful Soup Functions That Will Make Your Life Easier

This function can also be used in any other python usage, not just Beautiful Soup, but in my personal experience, it has come in handy so many times when operating on text elements and that is why I am putting it on this list.

Top 5 Beautiful Soup Functions That Will Make Your Life Easier

split()

This function also has a general-purpose for Python but I found it very useful as well.

It splits the string into different parts and we can use the parts that we desire.

It works with a combination of the separator and a string.

We use sep as the separator in our string for price and convert it to integer (whole number).

replace() just replaces ‘.’ with an empty string.

sep = ','
con_price = price.split(sep, 1)[0]
converted_price = int(con_price.replace('.', ''))

Top 5 Beautiful Soup Functions That Will Make Your Life Easier

Here are the final results:

I put the complete code for you in this Gist:

Just check your headers before you execute it.

If you want to run it, here is the terminal command:

python3 bs_tutorial.py

We are done!

Top 5 Beautiful Soup Functions That Will Make Your Life Easier

Last words

As mentioned before, this is not my first time writing about Beautiful Soup, Selenium and Web Scraping in general. There are many more functions I would love to cover and many more to come. I hope you liked this tutorial and in order to keep up, follow me for more!

Thanks for reading!

Top 5 Beautiful Soup Functions That Will Make Your Life Easier
Check out my other articles and follow me on Medium
Top 5 Beautiful Soup Functions That Will Make Your Life Easier
Follow me on Twitter for info when I get a new article out

以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

知识的边界

知识的边界

[美] 戴维·温伯格 / 胡泳、高美 / 山西人民出版社 / 2014-12-1 / 42.00元

大数据时代反思知识 因为事实不再是事实,专家随处可见 所有确定性都被连根拔起,话题再无边界,没有人对任何事情能达成一致。 在互联网的引领下,知识现在已经具有了社交性,流动且开放。温伯格向我们展示了这些特点如何可以为我们所用。 ——马克•贝尼奥夫(云计算之父,著有《云攻略》) 这本富有洞见的著作,奠定了温伯格作为数字时代最重要的思想家之一的地位。如果你想要理解信息洪流涌......一起来看看 《知识的边界》 这本书的介绍吧!

随机密码生成器
随机密码生成器

多种字符组合密码

Base64 编码/解码
Base64 编码/解码

Base64 编码/解码

XML 在线格式化
XML 在线格式化

在线 XML 格式化压缩工具