Python?頁(yè)面解析Beautiful?Soup庫(kù)的使用方法

更新時(shí)間：2022年09月09日 08:49:06 作者：小嗷犬

Beautiful?Soup?簡(jiǎn)稱?BS4（其中?4?表示版本號(hào)）是一個(gè)?Python?中常用的頁(yè)面解析庫(kù)，它可以從?HTML?或?XML?文檔中快速地提取指定的數(shù)據(jù)，這篇文章主要介紹了springboot?集成?docsify?實(shí)現(xiàn)隨身文檔?,需要的朋友可以參考下

1.Beautiful Soup庫(kù)簡(jiǎn)介

Beautiful Soup 簡(jiǎn)稱 BS4（其中 4 表示版本號(hào)）是一個(gè) Python 中常用的頁(yè)面解析庫(kù)，它可以從 HTML 或 XML 文檔中快速地提取指定的數(shù)據(jù)。

相比于之前講過(guò)的 lxml 庫(kù)，Beautiful Soup 更加簡(jiǎn)單易用，不像正則和 XPath 需要刻意去記住很多特定語(yǔ)法，盡管那樣會(huì)效率更高更直接。

對(duì)大多數(shù) Python 使用者來(lái)說(shuō)，好用會(huì)比高效更重要。

Beautiful Soup庫(kù)為第三方庫(kù)，需要我們通過(guò)pip命令安裝：

pip install bs4

BS4 解析頁(yè)面時(shí)需要依賴文檔解析器，所以還需要一個(gè)文檔解析器。
Python 自帶了一個(gè)文檔解析庫(kù) html.parser，但是其解析速度稍慢，所以我們結(jié)合上篇內(nèi)容（Python 文檔解析：lxml庫(kù)的使用），安裝 lxml 作為文檔解析庫(kù)：

pip install lxml

2.Beautiful Soup庫(kù)方法介紹

使用 bs4 的初始化操作，是用文本創(chuàng)建一個(gè) BeautifulSoup 對(duì)象，并指定文檔解析器：

from bs4 import BeautifulSoup

html_str = '''
<div>
    <ul>
        <li class="web" id="0"><a href="www.python.org" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow" >Python</a></li>
        <li class="web" id="1"><a href="www.java.com" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow" >Java</a></li>
        <li class="web" id="2"><a href="www.csdn.net" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow" >CSDN</a></li>
    </ul>
</div>
'''
soup = BeautifulSoup(html_str, 'lxml')
# prettify()用于格式化輸出HTML/XML文檔
print(soup.prettify())

bs4 提供了find_all()與find()兩個(gè)常用的查找方法它們的用法如下：

2.1 find_all()

find_all() 方法用來(lái)搜索當(dāng)前 tag 的所有子節(jié)點(diǎn)，并判斷這些節(jié)點(diǎn)是否符合過(guò)濾條件，最后以列表形式將符合條件的內(nèi)容返回，語(yǔ)法格式如下：

find_all(name, attrs, recursive, text, limit)

參數(shù)說(shuō)明：
name：查找所有名字為 name 的 tag 標(biāo)簽，字符串對(duì)象會(huì)被自動(dòng)忽略。attrs：按照屬性名和屬性值搜索 tag 標(biāo)簽，注意由于 class 是 Python 的關(guān)鍵字，所以要使用 “class_”。recursive：find_all() 會(huì)搜索 tag 的所有子孫節(jié)點(diǎn)，設(shè)置 recursive=False 可以只搜索 tag 的直接子節(jié)點(diǎn)。text：用來(lái)搜文檔中的字符串內(nèi)容，該參數(shù)可以接受字符串、正則表達(dá)式、列表、True。limit：由于 find_all() 會(huì)返回所有的搜索結(jié)果，這樣會(huì)影響執(zhí)行效率，通過(guò) limit 參數(shù)可以限制返回結(jié)果的數(shù)量。

from bs4 import BeautifulSoup

html_str = '''
<div>
    <ul>
        <li class="web" id="0"><a href="www.python.org" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow" >Python</a></li>
        <li class="web" id="1"><a href="www.java.com" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow" >Java</a></li>
        <li class="web" id="2"><a href="www.csdn.net" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow" >CSDN</a></li>
    </ul>
</div>
'''
soup = BeautifulSoup(html_str, 'lxml')

print(soup.find_all("li"))
print(soup.find_all("a"))
print(soup.find_all(text="Python"))

上面程序使用 find_all() 方法，來(lái)查找頁(yè)面中所有的<li></li>標(biāo)簽、<a></a>標(biāo)簽和"Python"字符串內(nèi)容。

2.2 find()

find() 方法與 find_all() 方法極其相似，不同之處在于 find() 僅返回第一個(gè)符合條件的結(jié)果，因此 find() 方法也沒(méi)有limit參數(shù)，語(yǔ)法格式如下：

find(name, attrs, recursive, text)

除了和 find_all() 相同的使用方式以外，bs4 為 find() 方法提供了一種簡(jiǎn)寫(xiě)方式：

soup.find("li")
soup.li

這兩行代碼的功能相同，都是返回第一個(gè)<li></li>標(biāo)簽，完整程序：

from bs4 import BeautifulSoup

html_str = '''
<div>
    <ul>
        <li class="web" id="0"><a href="www.python.org" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow" >Python</a></li>
        <li class="web" id="1"><a href="www.java.com" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow" >Java</a></li>
        <li class="web" id="2"><a href="www.csdn.net" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow" >CSDN</a></li>
    </ul>
</div>
'''
soup = BeautifulSoup(html_str, 'lxml')

print(soup.li)
print(soup.a)

上面的程序會(huì)打印出第一個(gè)<li></li>標(biāo)簽和第一個(gè)<a></a>標(biāo)簽。

2.3 select()

bs4 支持大部分的 CSS 選擇器，比如常見(jiàn)的標(biāo)簽選擇器、類選擇器、id 選擇器，以及層級(jí)選擇器。Beautiful Soup 提供了一個(gè) select() 方法，通過(guò)向該方法中添加選擇器，就可以在 HTML 文檔中搜索到與之對(duì)應(yīng)的內(nèi)容。

應(yīng)用如下：

from bs4 import BeautifulSoup

html_str = '''
<div>
    <ul>
        <li class="web" id="web0"><a href="www.python.org" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow" >Python</a></li>
        <li class="web" id="web1"><a href="www.java.com" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow" >Java</a></li>
        <li class="web" id="web2"><a href="www.csdn.net" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow" >CSDN</a></li>
    </ul>
</div>
'''
soup = BeautifulSoup(html_str, 'lxml')
#根據(jù)元素標(biāo)簽查找
print(soup.select('body'))
#根據(jù)屬性選擇器查找
print(soup.select('a[href]'))
#根據(jù)類查找
print(soup.select('.web'))
#后代節(jié)點(diǎn)查找
print(soup.select('div ul'))
#根據(jù)id查找
print(soup.select('#web1'))

更多方法及其詳細(xì)使用說(shuō)明，請(qǐng)參見(jiàn)官方文檔：
https://beautiful-soup-4.readthedocs.io/en/latest/

3.代碼實(shí)例

學(xué)會(huì)了 Beautiful Soup ，讓我們?cè)囍膶?xiě)一下上次的爬蟲(chóng)代碼吧：

import os
import sys
import requests
from bs4 import BeautifulSoup

x = requests.get('https://www.csdn.net/')

soup = BeautifulSoup(x.text, 'lxml')

img_list = soup.select('img[src]')

# 創(chuàng)建img文件夾
os.chdir(os.path.dirname(sys.argv[0]))

if not os.path.exists('img'):
    os.mkdir('img')
    print('創(chuàng)建文件夾成功')
else:
    print('文件夾已存在')

# 下載圖片
for i in range(len(img_list)):
    item = img_list[i]['src']
    img = requests.get(item).content
    if item.endswith('jpg'):
        with open(f'./img/{i}.jpg', 'wb') as f:
            f.write(img)
    elif item.endswith('jpeg'):
        with open(f'./img/{i}.jpeg', 'wb') as f:
            f.write(img)
    elif item.endswith('png'):
        with open(f'./img/{i}.png', 'wb') as f:
            f.write(img)
    else:
        print(f'第{i + 1}張圖片格式不正確')
        continue
    print(f'第{i + 1}張圖片下載成功')

這就是本文的全部?jī)?nèi)容了，快去動(dòng)手試試吧！

到此這篇關(guān)于Python 頁(yè)面解析Beautiful Soup庫(kù)的使用的文章就介紹到這了,更多相關(guān)Python Beautiful Soup庫(kù)內(nèi)容請(qǐng)搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家！

您可能感興趣的文章: