快速上手BeautifulSoup

栏目: Python · 发布时间: 6年前

内容简介:Win平台:执行演示HTML页面地址:页面源代码

Beautiful Soup库安装

Win平台:执行 pip install BeautifulSoup
快速上手BeautifulSoup

Beautiful Soup 库使用

演示HTML页面地址: https://python123.io/ws/demo.html

页面源代码

<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>
</body></html>

手工获取demo.html源代码,浏览器右键“查看源代码”

>>> demo = '''
<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>
</body></html>'''

或者,用Requests库获取demo.html源代码

>>> import requests
>>> r = requests.get('https://python123.io/ws/demo.html')
>>> demo = r.text

接下来使用Beautiful Soup库

>>> from bs4 from BeautifulSoup
>>> soup = BeautifulSoup(demo,'html.parser')
>>> print(soup.prettify())
<html>
 <head>
  <title>
   This is a python demo page
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The demo python introduces several python courses.
   </b>
  </p>
  <p class="course">
   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
    Basic Python
   </a>
   and
   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
    Advanced Python
   </a>
   .
  </p>
 </body>
</html>

Beautiful Soup库的基本元素

Beautiful Soup库的理解

快速上手BeautifulSoup

Beautiful Soup库是解析、遍历、维护“标签树”的功能库

快速上手BeautifulSoup

Beautiful Soup库的引用

Beautifule Soup库,也叫BeautifulSoup4或bs4,约定引用方式如下,即主要是用BeautifulSoup类

from bs4 import BeautifulSoup
imprt bs4

BeautifulSoup类

快速上手BeautifulSoup BeautifulSoup对应一个HTML/XML文档的全部内容

Beautiful Soup库解析器

快速上手BeautifulSoup

BeautifulSoup类的基本元素

快速上手BeautifulSoup

Tag标签

基本元素 说明
Tag 标签,最基本的信息组织单元,分别用<>和</>标明开头和结尾
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")
>>> soup.title
<title>This is a python demo page</title>
>>> tag = soup.a
>>> tag
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>

任何存在于HTML语法中的标签都可以用soup.<tag>访问获得 当HTML文档中存在多个相同<tag>对应内容时,soup.<tag>返回 第一个

Tag的name(名字)

基本元素 说明
Name 标签的名字,<p>…</p>的名字是'p',格式:<tag>.name
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")
>>> soup.a.name
'a'
>>> soup.a.parent.name
'p'
>>> soup.a.parent.parent.name
'body'

每个<tag>都有自己的名字,通过<tag>.name获取,字符串类型

Tag的attrs(属性)

基本元素 说明
Attributes 标签的属性,字典形式组织,格式:<tag>.attrs
>>> soup.a.attrs
{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
>>> kv = soup.a.attrs
>>> kv['class']
['py1']
>>> kv['href']
'http://www.icourse163.org/course/BIT-268001'
>>> tyep(soup.a)
<class 'bs4.element.Tag'>
>>> type(soup.a.attrs)
<class 'dict'>

一个<tag>可以有0或多个属性,字典类型

Tag的NavigableString

基本元素 说明
NavigableString 标签内非属性字符串,<>…</>中字符串,格式:<tag>.string
>>> soup.a
>>> soup.a.string
'Basic Python'
>>> soup.p
<p class="title"><b>The demo python introduces several python courses.</b></p>
>>> soup.p.string
'The demo python introduces several python courses.'
>>> type(soup.p.string)
<class 'bs4.element.NavigableString'>

NavigableString可以跨越多个层次

Tag的Comment

基本元素 说明
Comment 标签内字符串的注释部分,一种特殊的Comment类型
>>> newsoup = BeautifulSoup("<b><!--This is a comment--></b><p>This is not a comment</p>",'html.parser')
>>> newsoup.b.string
'This is a comment'
>>> newsoup.p.string
'This is not a comment'
>>> type(newsoup.b.string)
<class 'bs4.element.Comment'>
>>> type(newsoup.p.string)
<class 'bs4.element.NavigableString'>

基于bs4库的HTMl内容遍历方法

快速上手BeautifulSoup

标签树的下行遍历

快速上手BeautifulSoup

>>> soup = BeautifulSoup(demo,'html.parser')
>>> soup.head
<head><title>This is a python demo page</title></head>
>>> print(soup.head.contents)
[<title>This is a python demo page</title>]
>>> print(soup.body.contents)
['\n', <p class="title"><b>The demo python introduces several python courses.</b></p>, '\n', <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>, '\n']
>>> len(soup.body.contens)
5
>>> soup.body.contens[1]
<p class="title"><b>The demo python introduces several python courses.</b></p>

标签树的迭代下行遍历

快速上手BeautifulSoup

标签树的上行遍历

快速上手BeautifulSoup

>>> soup = BeautifulSoup(demo,'html.parser')
>>> soup.title.parent
<head><title>This is a python demo page</title></head>
>>> soup.html.parent
<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body></html>

标签树的迭代上行遍历

import requests
from bs4 import BeautifulSoup
r = requests.get("https://python123.io/ws/demo.html")
demo = r.text
soup = BeautifulSoup(demo,'html.parser')
for parent in soup.a.parents:
    if parent is None:
        print(parent)
    else:
        print(parent.name)

遍历所有先辈节点,包括soup本身,所以要区别判断

标签树的平行遍历

快速上手BeautifulSoup 平行遍历发生在同一个父节点下的各节点间

>>> soup = BeautifulSoup(demo,'html.parser')
>>> soup.a.next_sibling
' and '
>>> soup.a.next_sibling.next_sibling
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
>>> soup.a.previous_sibling
'Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n'

标签树的迭代平行遍历

快速上手BeautifulSoup

标签树的遍历总结

快速上手BeautifulSoup

bs4库的prettify()方法

prettify() 方法的作用是使HTML内容更友好的展示

普通方法打印HTML文本

>>> import requests
>>> r = requests.get("https://python123.io/ws/demo.html")
>>> r.text
'<html><head><title>This is a python demo page</title></head>\r\n<body>\r\n<p class="title"><b>The demo python introduces several python courses.</b></p>\r\n<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>\r\n</body></html>'

使用 prettify() 方法

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(r.text,'html.parser')
>>> soup.prettify()
'<html>\n <head>\n  <title>\n   This is a python demo page\n  </title>\n </head>\n <body>\n  <p class="title">\n   <b>\n    The demo python introduces several python courses.\n   </b>\n  </p>\n  <p class="course">\n   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\n   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">\n    Basic Python\n   </a>\n   and\n   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">\n    Advanced Python\n   </a>\n   .\n  </p>\n </body>\n</html>'
>>> print(soup.prettify()
<html>
 <head>
  <title>
   This is a python demo page
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The demo python introduces several python courses.
   </b>
  </p>
  <p class="course">
   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
    Basic Python
   </a>
   and
   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
    Advanced Python
   </a>
   .
  </p>
 </body>
</html>

prettify() 还可用于标签,使用方法: <tag>.prettify()

>>> print(soup.a.prettify())
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
 Basic Python
</a>

以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

Distributed Algorithms

Distributed Algorithms

Wan Fokkink / The MIT Press / 2013-12-6 / USD 40.00

This book offers students and researchers a guide to distributed algorithms that emphasizes examples and exercises rather than the intricacies of mathematical models. It avoids mathematical argumentat......一起来看看 《Distributed Algorithms》 这本书的介绍吧!

JSON 在线解析
JSON 在线解析

在线 JSON 格式化工具

HTML 编码/解码
HTML 编码/解码

HTML 编码/解码

HEX CMYK 转换工具
HEX CMYK 转换工具

HEX CMYK 互转工具