Python爬蟲之BeautifulSoup的基本使用教程

更新時(shí)間：2022年03月29日 09:38:17 作者：hacker707

Beautiful Soup提供一些簡(jiǎn)單的、python式的函數(shù)用來(lái)處理導(dǎo)航、搜索、修改分析樹等功,下面這篇文章主要給大家介紹了關(guān)于Python爬蟲之BeautifulSoup的基本使用教程,需要的朋友可以參考下

bs4的安裝

要使用BeautifulSoup4需要先安裝lxml,再安裝bs4

pip install lxml

pip install bs4

使用方法：

from bs4 import BeautifulSoup

lxml和bs4對(duì)比學(xué)習(xí)

from lxml import etree
tree = etree.HTML(html)
tree.xpath()

from bs4 import BeautifulSoup
soup =  BeautifulSoup(html_doc, 'lxml')

注意事項(xiàng)：

創(chuàng)建soup對(duì)象時(shí)如果不傳’lxml’或者features="lxml"會(huì)出現(xiàn)以下警告

bs4的快速入門

解析器的比較(了解即可)

解析器	用法	優(yōu)點(diǎn)	缺點(diǎn)
python標(biāo)準(zhǔn)庫(kù)	BeautifulSoup(markup,‘html.parser’)	python標(biāo)準(zhǔn)庫(kù)，執(zhí)行速度適中	(在python2.7.3或3.2.2之前的版本中)文檔容錯(cuò)能力差
lxml的HTML解析器	BeautifulSoup(markup,‘lxml’)	速度快，文檔容錯(cuò)能力強(qiáng)	需要安裝c語(yǔ)言庫(kù)
lxml的XML解析器	BeautifulSoup(markup,‘lxml-xml’)或者BeautifulSoup(markup,‘xml’)	速度快，唯一支持XML的解析器	需要安裝c語(yǔ)言庫(kù)
html5lib	BeautifulSoup(markup,‘html5lib’)	最好的容錯(cuò)性，以瀏覽器的方式解析文檔，生成HTML5格式的文檔	速度慢，不依賴外部擴(kuò)展

對(duì)象種類

Tag：標(biāo)簽
BeautifulSoup：bs對(duì)象
NavigableString：可導(dǎo)航的字符串
Comment：注釋

from bs4 import BeautifulSoup

# 創(chuàng)建模擬HTML代碼的字符串
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a  class="sister" id="link1">Elsie</a>,
<a  class="sister" id="link2">Lacie</a> and
<a  class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>

<span><!--comment注釋內(nèi)容舉例--></span>
"""
# 創(chuàng)建soup對(duì)象
soup = BeautifulSoup(html_doc, 'lxml')
print(type(soup.title))  # <class 'bs4.element.Tag'>
print(type(soup))  # <class 'bs4.BeautifulSoup'>
print(type(soup.title.string))  # <class 'bs4.element.NavigableString'>
print(type(soup.span.string))  # <class 'bs4.element.Comment'>

bs4的簡(jiǎn)單使用

獲取標(biāo)簽內(nèi)容

from bs4 import BeautifulSoup

# 創(chuàng)建模擬HTML代碼的字符串
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a  class="sister" id="link1">Elsie</a>,
<a  class="sister" id="link2">Lacie</a> and
<a  class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
</body>
</html>
"""
# 創(chuàng)建soup對(duì)象
soup = BeautifulSoup(html_doc, 'lxml')
print('head標(biāo)簽內(nèi)容:\n', soup.head)  # 打印head標(biāo)簽
print('body標(biāo)簽內(nèi)容:\n', soup.body)  # 打印body標(biāo)簽
print('html標(biāo)簽內(nèi)容:\n', soup.html)  # 打印html標(biāo)簽
print('p標(biāo)簽內(nèi)容:\n', soup.p)  # 打印p標(biāo)簽

注意：在打印p標(biāo)簽對(duì)應(yīng)的代碼時(shí)，可以發(fā)現(xiàn)只打印了第一個(gè)p標(biāo)簽內(nèi)容，這時(shí)我們可以通過(guò)find_all來(lái)獲取p標(biāo)簽全部?jī)?nèi)容

print('p標(biāo)簽內(nèi)容:\n', soup.find_all('p'))

?這里需要注意使用find_all里面必須傳入的是字符串

獲取標(biāo)簽名字

通過(guò)name屬性獲取標(biāo)簽名字

from bs4 import BeautifulSoup

# 創(chuàng)建模擬HTML代碼的字符串
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a  class="sister" id="link1">Elsie</a>,
<a  class="sister" id="link2">Lacie</a> and
<a  class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
</body>
</html>
"""
# 創(chuàng)建soup對(duì)象
soup = BeautifulSoup(html_doc, 'lxml')
print('head標(biāo)簽名字:\n', soup.head.name)  # 打印head標(biāo)簽名字
print('body標(biāo)簽名字:\n', soup.body.name)  # 打印body標(biāo)簽名字
print('html標(biāo)簽名字:\n', soup.html.name)  # 打印html標(biāo)簽名字
print('p標(biāo)簽名字:\n', soup.find_all('p').name)  # 打印p標(biāo)簽名字

如果要找到兩個(gè)標(biāo)簽的內(nèi)容，需要傳入列表過(guò)濾器，而不是字符串過(guò)濾器

使用字符串過(guò)濾器獲取多個(gè)標(biāo)簽內(nèi)容會(huì)返回空列表

print(soup.find_all('title', 'p'))

[]

需要使用列表過(guò)濾器獲取多個(gè)標(biāo)簽內(nèi)容

print(soup.find_all(['title', 'p']))

[<title>The Dormouse's story</title>, The Dormouse's story, Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well., ...]

獲取a標(biāo)簽的href屬性值

from bs4 import BeautifulSoup

# 創(chuàng)建模擬HTML代碼的字符串
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a  class="sister" id="link1">Elsie</a>,
<a  class="sister" id="link2">Lacie</a> and
<a  class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
# 創(chuàng)建soup對(duì)象
soup = BeautifulSoup(html_doc, 'lxml')
a_list = soup.find_all('a')
# 遍歷列表取屬性值
for a in a_list:
    # 第一種方法通過(guò)get去獲取href屬性值(沒(méi)有找到返回None)
    print(a.get('href'))
    # 第二種方法先通過(guò)attrs獲取所有屬性值，再提取出你想要的屬性值
    print(a.attrs['href'])
    # 第三種方法獲取沒(méi)有的屬性值會(huì)報(bào)錯(cuò)
    print(a['href'])

擴(kuò)展：使用prettify()美化讓節(jié)點(diǎn)層級(jí)關(guān)系更加明顯方便分析

print(soup.prettify())

不使用prettify時(shí)的代碼

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister"  id="link1">Elsie</a>,
<a class="sister"  id="link2">Lacie</a> and
<a class="sister"  id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>

遍歷文檔樹

from bs4 import BeautifulSoup

# 創(chuàng)建模擬HTML代碼的字符串
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a  class="sister" id="link1">Elsie</a>,
<a  class="sister" id="link2">Lacie</a> and
<a  class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
</body>
</html>
"""
soup = BeautifulSoup(html_doc, 'lxml')
head = soup.head
# contents返回的是所有子節(jié)點(diǎn)的列表 [<title>The Dormouse's story</title>]
print(head.contents)
# children返回的是一個(gè)子節(jié)點(diǎn)的迭代器 <list_iterator object at 0x00000264BADC2748>
print(head.children)
# 凡是迭代器都是可以遍歷的
for h in head.children:
    print(h)
html = soup.html  # 會(huì)把換行也當(dāng)作子節(jié)點(diǎn)匹配到
# descendants 返回的是一個(gè)生成器遍歷子子孫孫  <generator object Tag.descendants at 0x0000018C15BFF4C8>
print(html.descendants)
# 凡是生成器都是可遍歷的
for h in html.descendants:
    print(h)

'''
需要重點(diǎn)掌握的
string獲取標(biāo)簽里面的內(nèi)容
strings 返回是一個(gè)生成器對(duì)象用過(guò)來(lái)獲取多個(gè)標(biāo)簽內(nèi)容
stripped_strings 和strings基本一致 但是它可以把多余的空格去掉
'''
print(soup.title.string)
print(soup.html.string)
# 返回生成器對(duì)象<generator object Tag._all_strings at 0x000001AAFF9EF4C8>
# soup.html.strings 包含在html標(biāo)簽里面的文本都會(huì)被獲取到
print(soup.html.strings)
for h in soup.html.strings:
    print(h)
# stripped_strings可以把多余的空格去掉
# 返回生成器對(duì)象<generator object PageElement.stripped_strings at 0x000001E31284F4C8>
print(soup.html.stripped_strings)
for h in soup.html.stripped_strings:
    print(h)
'''
parent直接獲得父節(jié)點(diǎn)
parents獲取所有的父節(jié)點(diǎn)
'''
title = soup.title
# parent找直接父節(jié)點(diǎn)
print(title.parent)
# parents獲取所有父節(jié)點(diǎn)
# 返回生成器對(duì)象<generator object PageElement.parents at 0x000001F02049F4C8>
print(title.parents)
for p in title.parents:
    print(p)
# html的父節(jié)點(diǎn)就是整個(gè)文檔
print(soup.html.parent)
# <class 'bs4.BeautifulSoup'>
print(type(soup.html.parent))

案例練習(xí)

獲取所有職位名稱

html = """
<table class="tablelist" cellpadding="0" cellspacing="0">
    <tbody>
        <tr class="h">
            <td class="l" width="374">職位名稱</td>
            <td>職位類別</td>
            <td>人數(shù)</td>
            <td>地點(diǎn)</td>
            <td>發(fā)布時(shí)間</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=33824&keywords=python&tid=87&lid=2218">22989-金融云區(qū)塊鏈高級(jí)研發(fā)工程師（深圳）</a></td>
            <td>技術(shù)類</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-25</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a target="_blank" href="position_detail.php?id=29938&keywords=python&tid=87&lid=2218">22989-金融云高級(jí)后臺(tái)開發(fā)</a></td>
            <td>技術(shù)類</td>
            <td>2</td>
            <td>深圳</td>
            <td>2017-11-25</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=31236&keywords=python&tid=87&lid=2218">SNG16-騰訊音樂(lè)運(yùn)營(yíng)開發(fā)工程師（深圳）</a></td>
            <td>技術(shù)類</td>
            <td>2</td>
            <td>深圳</td>
            <td>2017-11-25</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a target="_blank" href="position_detail.php?id=31235&keywords=python&tid=87&lid=2218">SNG16-騰訊音樂(lè)業(yè)務(wù)運(yùn)維工程師（深圳）</a></td>
            <td>技術(shù)類</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-25</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=34531&keywords=python&tid=87&lid=2218">TEG03-高級(jí)研發(fā)工程師（深圳）</a></td>
            <td>技術(shù)類</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a target="_blank" href="position_detail.php?id=34532&keywords=python&tid=87&lid=2218">TEG03-高級(jí)圖像算法研發(fā)工程師（深圳）</a></td>
            <td>技術(shù)類</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=31648&keywords=python&tid=87&lid=2218">TEG11-高級(jí)AI開發(fā)工程師（深圳）</a></td>
            <td>技術(shù)類</td>
            <td>4</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a target="_blank" href="position_detail.php?id=32218&keywords=python&tid=87&lid=2218">15851-后臺(tái)開發(fā)工程師</a></td>
            <td>技術(shù)類</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=32217&keywords=python&tid=87&lid=2218">15851-后臺(tái)開發(fā)工程師</a></td>
            <td>技術(shù)類</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a id="test" class="test" target='_blank' href="position_detail.php?id=34511&keywords=python&tid=87&lid=2218">SNG11-高級(jí)業(yè)務(wù)運(yùn)維工程師（深圳）</a></td>
            <td>技術(shù)類</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
    </tbody>
</table>
"""

思路

不難看出想要的數(shù)據(jù)在tr節(jié)點(diǎn)的a標(biāo)簽里，只需要遍歷所有的tr節(jié)點(diǎn)，從遍歷出來(lái)的tr節(jié)點(diǎn)取a標(biāo)簽里面的文本數(shù)據(jù)

代碼實(shí)現(xiàn)

from bs4 import BeautifulSoup

html = """
<table class="tablelist" cellpadding="0" cellspacing="0">
    <tbody>
        <tr class="h">
            <td class="l" width="374">職位名稱</td>
            <td>職位類別</td>
            <td>人數(shù)</td>
            <td>地點(diǎn)</td>
            <td>發(fā)布時(shí)間</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=33824&keywords=python&tid=87&lid=2218">22989-金融云區(qū)塊鏈高級(jí)研發(fā)工程師（深圳）</a></td>
            <td>技術(shù)類</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-25</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a target="_blank" href="position_detail.php?id=29938&keywords=python&tid=87&lid=2218">22989-金融云高級(jí)后臺(tái)開發(fā)</a></td>
            <td>技術(shù)類</td>
            <td>2</td>
            <td>深圳</td>
            <td>2017-11-25</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=31236&keywords=python&tid=87&lid=2218">SNG16-騰訊音樂(lè)運(yùn)營(yíng)開發(fā)工程師（深圳）</a></td>
            <td>技術(shù)類</td>
            <td>2</td>
            <td>深圳</td>
            <td>2017-11-25</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a target="_blank" href="position_detail.php?id=31235&keywords=python&tid=87&lid=2218">SNG16-騰訊音樂(lè)業(yè)務(wù)運(yùn)維工程師（深圳）</a></td>
            <td>技術(shù)類</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-25</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=34531&keywords=python&tid=87&lid=2218">TEG03-高級(jí)研發(fā)工程師（深圳）</a></td>
            <td>技術(shù)類</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a target="_blank" href="position_detail.php?id=34532&keywords=python&tid=87&lid=2218">TEG03-高級(jí)圖像算法研發(fā)工程師（深圳）</a></td>
            <td>技術(shù)類</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=31648&keywords=python&tid=87&lid=2218">TEG11-高級(jí)AI開發(fā)工程師（深圳）</a></td>
            <td>技術(shù)類</td>
            <td>4</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a target="_blank" href="position_detail.php?id=32218&keywords=python&tid=87&lid=2218">15851-后臺(tái)開發(fā)工程師</a></td>
            <td>技術(shù)類</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=32217&keywords=python&tid=87&lid=2218">15851-后臺(tái)開發(fā)工程師</a></td>
            <td>技術(shù)類</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a id="test" class="test" target='_blank' href="position_detail.php?id=34511&keywords=python&tid=87&lid=2218">SNG11-高級(jí)業(yè)務(wù)運(yùn)維工程師（深圳）</a></td>
            <td>技術(shù)類</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
    </tbody>
</table>
"""
# 創(chuàng)建soup對(duì)象
soup = BeautifulSoup(html, 'lxml')
# 使用find_all()找到所有的tr節(jié)點(diǎn)(經(jīng)過(guò)觀察第一個(gè)tr節(jié)點(diǎn)為表頭,忽略不計(jì))
tr_list = soup.find_all('tr')[1:]
# 遍歷tr_list取a標(biāo)簽里的文本數(shù)據(jù)
for tr in tr_list:
    a_list = tr.find_all('a')
    print(a_list[0].string)

運(yùn)行結(jié)果如下：

22989-金融云區(qū)塊鏈高級(jí)研發(fā)工程師（深圳）
22989-金融云高級(jí)后臺(tái)開發(fā)
SNG16-騰訊音樂(lè)運(yùn)營(yíng)開發(fā)工程師（深圳）
SNG16-騰訊音樂(lè)業(yè)務(wù)運(yùn)維工程師（深圳）
TEG03-高級(jí)研發(fā)工程師（深圳）
TEG03-高級(jí)圖像算法研發(fā)工程師（深圳）
TEG11-高級(jí)AI開發(fā)工程師（深圳）
15851-后臺(tái)開發(fā)工程師
15851-后臺(tái)開發(fā)工程師
SNG11-高級(jí)業(yè)務(wù)運(yùn)維工程師（深圳）

總結(jié)

到此這篇關(guān)于Python爬蟲之BeautifulSoup基本使用的文章就介紹到這了,更多相關(guān)Python BeautifulSoup使用內(nèi)容請(qǐng)搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家！

您可能感興趣的文章:

国产无遮挡裸体免费直播视频,久久精品国产蜜臀av,动漫在线视频一区二区,欧亚日韩一区二区三区,久艹在线免费视频,国产精品美女网站免费,正在播放 97超级视频在线观看,斗破苍穹年番在线观看免费,51最新乱码中文字幕

Python爬蟲之BeautifulSoup的基本使用教程

目錄

bs4的安裝

bs4的快速入門

解析器的比較(了解即可)

對(duì)象種類

bs4的簡(jiǎn)單使用

獲取標(biāo)簽內(nèi)容

獲取標(biāo)簽名字

獲取a標(biāo)簽的href屬性值

遍歷文檔樹

案例練習(xí)

思路

代碼實(shí)現(xiàn)

總結(jié)

相關(guān)文章

最新評(píng)論

大家感興趣的內(nèi)容

最近更新的內(nèi)容

常用在線小工具

国产无遮挡裸体免费直播视频,久久精品国产蜜臀av,动漫在线视频一区二区,欧亚日韩一区二区三区,久艹在线 免费视频,国产精品美女网站免费,正在播放 97超级视频在线观看,斗破苍穹年番在线观看免费,51最新乱码中文字幕

Python爬蟲之BeautifulSoup的基本使用教程

目錄

bs4的安裝

bs4的快速入門

解析器的比較(了解即可)

對(duì)象種類

bs4的簡(jiǎn)單使用

獲取標(biāo)簽內(nèi)容

獲取標(biāo)簽名字

獲取a標(biāo)簽的href屬性值

遍歷文檔樹

案例練習(xí)

思路

代碼實(shí)現(xiàn)

總結(jié)

相關(guān)文章

最新評(píng)論

大家感興趣的內(nèi)容

最近更新的內(nèi)容

常用在線小工具

国产无遮挡裸体免费直播视频,久久精品国产蜜臀av,动漫在线视频一区二区,欧亚日韩一区二区三区,久艹在线免费视频,国产精品美女网站免费,正在播放 97超级视频在线观看,斗破苍穹年番在线观看免费,51最新乱码中文字幕