Python爬蟲之BeautifulSoup的基本使用教程
bs4的安裝
要使用BeautifulSoup4需要先安裝lxml,再安裝bs4
pip install lxml
pip install bs4
使用方法:
from bs4 import BeautifulSoup
lxml和bs4對(duì)比學(xué)習(xí)
from lxml import etree tree = etree.HTML(html) tree.xpath()
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'lxml')
注意事項(xiàng):
創(chuàng)建soup對(duì)象時(shí)如果不傳’lxml’或者features="lxml"會(huì)出現(xiàn)以下警告

bs4的快速入門
解析器的比較(了解即可)
| 解析器 | 用法 | 優(yōu)點(diǎn) | 缺點(diǎn) |
|---|---|---|---|
| python標(biāo)準(zhǔn)庫(kù) | BeautifulSoup(markup,‘html.parser’) | python標(biāo)準(zhǔn)庫(kù),執(zhí)行速度適中 | (在python2.7.3或3.2.2之前的版本中)文檔容錯(cuò)能力差 |
| lxml的HTML解析器 | BeautifulSoup(markup,‘lxml’) | 速度快,文檔容錯(cuò)能力強(qiáng) | 需要安裝c語(yǔ)言庫(kù) |
| lxml的XML解析器 | BeautifulSoup(markup,‘lxml-xml’)或者BeautifulSoup(markup,‘xml’) | 速度快,唯一支持XML的解析器 | 需要安裝c語(yǔ)言庫(kù) |
| html5lib | BeautifulSoup(markup,‘html5lib’) | 最好的容錯(cuò)性,以瀏覽器的方式解析文檔,生成HTML5格式的文檔 | 速度慢,不依賴外部擴(kuò)展 |
對(duì)象種類
Tag:標(biāo)簽
BeautifulSoup:bs對(duì)象
NavigableString:可導(dǎo)航的字符串
Comment:注釋
from bs4 import BeautifulSoup # 創(chuàng)建模擬HTML代碼的字符串 html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a class="sister" id="link1">Elsie</a>, <a class="sister" id="link2">Lacie</a> and <a class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> <span><!--comment注釋內(nèi)容舉例--></span> """ # 創(chuàng)建soup對(duì)象 soup = BeautifulSoup(html_doc, 'lxml') print(type(soup.title)) # <class 'bs4.element.Tag'> print(type(soup)) # <class 'bs4.BeautifulSoup'> print(type(soup.title.string)) # <class 'bs4.element.NavigableString'> print(type(soup.span.string)) # <class 'bs4.element.Comment'>
bs4的簡(jiǎn)單使用
獲取標(biāo)簽內(nèi)容
from bs4 import BeautifulSoup
# 創(chuàng)建模擬HTML代碼的字符串
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" id="link1">Elsie</a>,
<a class="sister" id="link2">Lacie</a> and
<a class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
</html>
"""
# 創(chuàng)建soup對(duì)象
soup = BeautifulSoup(html_doc, 'lxml')
print('head標(biāo)簽內(nèi)容:\n', soup.head) # 打印head標(biāo)簽
print('body標(biāo)簽內(nèi)容:\n', soup.body) # 打印body標(biāo)簽
print('html標(biāo)簽內(nèi)容:\n', soup.html) # 打印html標(biāo)簽
print('p標(biāo)簽內(nèi)容:\n', soup.p) # 打印p標(biāo)簽
注意:在打印p標(biāo)簽對(duì)應(yīng)的代碼時(shí),可以發(fā)現(xiàn)只打印了第一個(gè)p標(biāo)簽內(nèi)容,這時(shí)我們可以通過(guò)find_all來(lái)獲取p標(biāo)簽全部?jī)?nèi)容
print('p標(biāo)簽內(nèi)容:\n', soup.find_all('p'))
?這里需要注意使用find_all里面必須傳入的是字符串
獲取標(biāo)簽名字
通過(guò)name屬性獲取標(biāo)簽名字
from bs4 import BeautifulSoup
# 創(chuàng)建模擬HTML代碼的字符串
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" id="link1">Elsie</a>,
<a class="sister" id="link2">Lacie</a> and
<a class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
</html>
"""
# 創(chuàng)建soup對(duì)象
soup = BeautifulSoup(html_doc, 'lxml')
print('head標(biāo)簽名字:\n', soup.head.name) # 打印head標(biāo)簽名字
print('body標(biāo)簽名字:\n', soup.body.name) # 打印body標(biāo)簽名字
print('html標(biāo)簽名字:\n', soup.html.name) # 打印html標(biāo)簽名字
print('p標(biāo)簽名字:\n', soup.find_all('p').name) # 打印p標(biāo)簽名字
如果要找到兩個(gè)標(biāo)簽的內(nèi)容,需要傳入列表過(guò)濾器,而不是字符串過(guò)濾器
使用字符串過(guò)濾器獲取多個(gè)標(biāo)簽內(nèi)容會(huì)返回空列表
print(soup.find_all('title', 'p'))
[]
需要使用列表過(guò)濾器獲取多個(gè)標(biāo)簽內(nèi)容
print(soup.find_all(['title', 'p']))
[<title>The Dormouse's story</title>, <p class="title"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>, <p class="story">...</p>]
獲取a標(biāo)簽的href屬性值
from bs4 import BeautifulSoup
# 創(chuàng)建模擬HTML代碼的字符串
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" id="link1">Elsie</a>,
<a class="sister" id="link2">Lacie</a> and
<a class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
# 創(chuàng)建soup對(duì)象
soup = BeautifulSoup(html_doc, 'lxml')
a_list = soup.find_all('a')
# 遍歷列表取屬性值
for a in a_list:
# 第一種方法通過(guò)get去獲取href屬性值(沒(méi)有找到返回None)
print(a.get('href'))
# 第二種方法先通過(guò)attrs獲取所有屬性值,再提取出你想要的屬性值
print(a.attrs['href'])
# 第三種方法獲取沒(méi)有的屬性值會(huì)報(bào)錯(cuò)
print(a['href'])
擴(kuò)展:使用prettify()美化 讓節(jié)點(diǎn)層級(jí)關(guān)系更加明顯 方便分析
print(soup.prettify())
不使用prettify時(shí)的代碼
<html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a class="sister" id="link1">Elsie</a>, <a class="sister" id="link2">Lacie</a> and <a class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> </body></html>
遍歷文檔樹
from bs4 import BeautifulSoup
# 創(chuàng)建模擬HTML代碼的字符串
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" id="link1">Elsie</a>,
<a class="sister" id="link2">Lacie</a> and
<a class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
</html>
"""
soup = BeautifulSoup(html_doc, 'lxml')
head = soup.head
# contents返回的是所有子節(jié)點(diǎn)的列表 [<title>The Dormouse's story</title>]
print(head.contents)
# children返回的是一個(gè)子節(jié)點(diǎn)的迭代器 <list_iterator object at 0x00000264BADC2748>
print(head.children)
# 凡是迭代器都是可以遍歷的
for h in head.children:
print(h)
html = soup.html # 會(huì)把換行也當(dāng)作子節(jié)點(diǎn)匹配到
# descendants 返回的是一個(gè)生成器遍歷子子孫孫 <generator object Tag.descendants at 0x0000018C15BFF4C8>
print(html.descendants)
# 凡是生成器都是可遍歷的
for h in html.descendants:
print(h)
'''
需要重點(diǎn)掌握的
string獲取標(biāo)簽里面的內(nèi)容
strings 返回是一個(gè)生成器對(duì)象用過(guò)來(lái)獲取多個(gè)標(biāo)簽內(nèi)容
stripped_strings 和strings基本一致 但是它可以把多余的空格去掉
'''
print(soup.title.string)
print(soup.html.string)
# 返回生成器對(duì)象<generator object Tag._all_strings at 0x000001AAFF9EF4C8>
# soup.html.strings 包含在html標(biāo)簽里面的文本都會(huì)被獲取到
print(soup.html.strings)
for h in soup.html.strings:
print(h)
# stripped_strings可以把多余的空格去掉
# 返回生成器對(duì)象<generator object PageElement.stripped_strings at 0x000001E31284F4C8>
print(soup.html.stripped_strings)
for h in soup.html.stripped_strings:
print(h)
'''
parent直接獲得父節(jié)點(diǎn)
parents獲取所有的父節(jié)點(diǎn)
'''
title = soup.title
# parent找直接父節(jié)點(diǎn)
print(title.parent)
# parents獲取所有父節(jié)點(diǎn)
# 返回生成器對(duì)象<generator object PageElement.parents at 0x000001F02049F4C8>
print(title.parents)
for p in title.parents:
print(p)
# html的父節(jié)點(diǎn)就是整個(gè)文檔
print(soup.html.parent)
# <class 'bs4.BeautifulSoup'>
print(type(soup.html.parent))
案例練習(xí)
獲取所有職位名稱
html = """
<table class="tablelist" cellpadding="0" cellspacing="0">
<tbody>
<tr class="h">
<td class="l" width="374">職位名稱</td>
<td>職位類別</td>
<td>人數(shù)</td>
<td>地點(diǎn)</td>
<td>發(fā)布時(shí)間</td>
</tr>
<tr class="even">
<td class="l square"><a target="_blank" href="position_detail.php?id=33824&keywords=python&tid=87&lid=2218">22989-金融云區(qū)塊鏈高級(jí)研發(fā)工程師(深圳)</a></td>
<td>技術(shù)類</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-25</td>
</tr>
<tr class="odd">
<td class="l square"><a target="_blank" href="position_detail.php?id=29938&keywords=python&tid=87&lid=2218">22989-金融云高級(jí)后臺(tái)開發(fā)</a></td>
<td>技術(shù)類</td>
<td>2</td>
<td>深圳</td>
<td>2017-11-25</td>
</tr>
<tr class="even">
<td class="l square"><a target="_blank" href="position_detail.php?id=31236&keywords=python&tid=87&lid=2218">SNG16-騰訊音樂(lè)運(yùn)營(yíng)開發(fā)工程師(深圳)</a></td>
<td>技術(shù)類</td>
<td>2</td>
<td>深圳</td>
<td>2017-11-25</td>
</tr>
<tr class="odd">
<td class="l square"><a target="_blank" href="position_detail.php?id=31235&keywords=python&tid=87&lid=2218">SNG16-騰訊音樂(lè)業(yè)務(wù)運(yùn)維工程師(深圳)</a></td>
<td>技術(shù)類</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-25</td>
</tr>
<tr class="even">
<td class="l square"><a target="_blank" href="position_detail.php?id=34531&keywords=python&tid=87&lid=2218">TEG03-高級(jí)研發(fā)工程師(深圳)</a></td>
<td>技術(shù)類</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-24</td>
</tr>
<tr class="odd">
<td class="l square"><a target="_blank" href="position_detail.php?id=34532&keywords=python&tid=87&lid=2218">TEG03-高級(jí)圖像算法研發(fā)工程師(深圳)</a></td>
<td>技術(shù)類</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-24</td>
</tr>
<tr class="even">
<td class="l square"><a target="_blank" href="position_detail.php?id=31648&keywords=python&tid=87&lid=2218">TEG11-高級(jí)AI開發(fā)工程師(深圳)</a></td>
<td>技術(shù)類</td>
<td>4</td>
<td>深圳</td>
<td>2017-11-24</td>
</tr>
<tr class="odd">
<td class="l square"><a target="_blank" href="position_detail.php?id=32218&keywords=python&tid=87&lid=2218">15851-后臺(tái)開發(fā)工程師</a></td>
<td>技術(shù)類</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-24</td>
</tr>
<tr class="even">
<td class="l square"><a target="_blank" href="position_detail.php?id=32217&keywords=python&tid=87&lid=2218">15851-后臺(tái)開發(fā)工程師</a></td>
<td>技術(shù)類</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-24</td>
</tr>
<tr class="odd">
<td class="l square"><a id="test" class="test" target='_blank' href="position_detail.php?id=34511&keywords=python&tid=87&lid=2218">SNG11-高級(jí)業(yè)務(wù)運(yùn)維工程師(深圳)</a></td>
<td>技術(shù)類</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-24</td>
</tr>
</tbody>
</table>
"""
思路
不難看出想要的數(shù)據(jù)在tr節(jié)點(diǎn)的a標(biāo)簽里,只需要遍歷所有的tr節(jié)點(diǎn),從遍歷出來(lái)的tr節(jié)點(diǎn)取a標(biāo)簽里面的文本數(shù)據(jù)
代碼實(shí)現(xiàn)
from bs4 import BeautifulSoup
html = """
<table class="tablelist" cellpadding="0" cellspacing="0">
<tbody>
<tr class="h">
<td class="l" width="374">職位名稱</td>
<td>職位類別</td>
<td>人數(shù)</td>
<td>地點(diǎn)</td>
<td>發(fā)布時(shí)間</td>
</tr>
<tr class="even">
<td class="l square"><a target="_blank" href="position_detail.php?id=33824&keywords=python&tid=87&lid=2218">22989-金融云區(qū)塊鏈高級(jí)研發(fā)工程師(深圳)</a></td>
<td>技術(shù)類</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-25</td>
</tr>
<tr class="odd">
<td class="l square"><a target="_blank" href="position_detail.php?id=29938&keywords=python&tid=87&lid=2218">22989-金融云高級(jí)后臺(tái)開發(fā)</a></td>
<td>技術(shù)類</td>
<td>2</td>
<td>深圳</td>
<td>2017-11-25</td>
</tr>
<tr class="even">
<td class="l square"><a target="_blank" href="position_detail.php?id=31236&keywords=python&tid=87&lid=2218">SNG16-騰訊音樂(lè)運(yùn)營(yíng)開發(fā)工程師(深圳)</a></td>
<td>技術(shù)類</td>
<td>2</td>
<td>深圳</td>
<td>2017-11-25</td>
</tr>
<tr class="odd">
<td class="l square"><a target="_blank" href="position_detail.php?id=31235&keywords=python&tid=87&lid=2218">SNG16-騰訊音樂(lè)業(yè)務(wù)運(yùn)維工程師(深圳)</a></td>
<td>技術(shù)類</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-25</td>
</tr>
<tr class="even">
<td class="l square"><a target="_blank" href="position_detail.php?id=34531&keywords=python&tid=87&lid=2218">TEG03-高級(jí)研發(fā)工程師(深圳)</a></td>
<td>技術(shù)類</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-24</td>
</tr>
<tr class="odd">
<td class="l square"><a target="_blank" href="position_detail.php?id=34532&keywords=python&tid=87&lid=2218">TEG03-高級(jí)圖像算法研發(fā)工程師(深圳)</a></td>
<td>技術(shù)類</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-24</td>
</tr>
<tr class="even">
<td class="l square"><a target="_blank" href="position_detail.php?id=31648&keywords=python&tid=87&lid=2218">TEG11-高級(jí)AI開發(fā)工程師(深圳)</a></td>
<td>技術(shù)類</td>
<td>4</td>
<td>深圳</td>
<td>2017-11-24</td>
</tr>
<tr class="odd">
<td class="l square"><a target="_blank" href="position_detail.php?id=32218&keywords=python&tid=87&lid=2218">15851-后臺(tái)開發(fā)工程師</a></td>
<td>技術(shù)類</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-24</td>
</tr>
<tr class="even">
<td class="l square"><a target="_blank" href="position_detail.php?id=32217&keywords=python&tid=87&lid=2218">15851-后臺(tái)開發(fā)工程師</a></td>
<td>技術(shù)類</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-24</td>
</tr>
<tr class="odd">
<td class="l square"><a id="test" class="test" target='_blank' href="position_detail.php?id=34511&keywords=python&tid=87&lid=2218">SNG11-高級(jí)業(yè)務(wù)運(yùn)維工程師(深圳)</a></td>
<td>技術(shù)類</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-24</td>
</tr>
</tbody>
</table>
"""
# 創(chuàng)建soup對(duì)象
soup = BeautifulSoup(html, 'lxml')
# 使用find_all()找到所有的tr節(jié)點(diǎn)(經(jīng)過(guò)觀察第一個(gè)tr節(jié)點(diǎn)為表頭,忽略不計(jì))
tr_list = soup.find_all('tr')[1:]
# 遍歷tr_list取a標(biāo)簽里的文本數(shù)據(jù)
for tr in tr_list:
a_list = tr.find_all('a')
print(a_list[0].string)
運(yùn)行結(jié)果如下:
22989-金融云區(qū)塊鏈高級(jí)研發(fā)工程師(深圳)
22989-金融云高級(jí)后臺(tái)開發(fā)
SNG16-騰訊音樂(lè)運(yùn)營(yíng)開發(fā)工程師(深圳)
SNG16-騰訊音樂(lè)業(yè)務(wù)運(yùn)維工程師(深圳)
TEG03-高級(jí)研發(fā)工程師(深圳)
TEG03-高級(jí)圖像算法研發(fā)工程師(深圳)
TEG11-高級(jí)AI開發(fā)工程師(深圳)
15851-后臺(tái)開發(fā)工程師
15851-后臺(tái)開發(fā)工程師
SNG11-高級(jí)業(yè)務(wù)運(yùn)維工程師(深圳)
總結(jié)
到此這篇關(guān)于Python爬蟲之BeautifulSoup基本使用的文章就介紹到這了,更多相關(guān)Python BeautifulSoup使用內(nèi)容請(qǐng)搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家!
- Python使用Beautiful?Soup(BS4)庫(kù)解析HTML和XML
- Python使用BeautifulSoup4修改網(wǎng)頁(yè)內(nèi)容的實(shí)戰(zhàn)記錄
- python?beautifulsoup4?模塊詳情
- python?中的?BeautifulSoup?網(wǎng)頁(yè)使用方法解析
- Python中BeautifulSoup模塊詳解
- Python爬取求職網(wǎng)requests庫(kù)和BeautifulSoup庫(kù)使用詳解
- Python實(shí)戰(zhàn)快速上手BeautifulSoup庫(kù)爬取專欄標(biāo)題和地址
- python數(shù)據(jù)解析BeautifulSoup爬取三國(guó)演義章節(jié)示例
- python爬蟲beautiful?soup的使用方式
相關(guān)文章
Python腳本實(shí)現(xiàn)音頻和視頻格式轉(zhuǎn)換
這篇文章主要為大家詳細(xì)介紹了Python如何通過(guò)腳本實(shí)現(xiàn)音頻和視頻格式轉(zhuǎn)換,文中的示例代碼講解詳細(xì),感興趣的小伙伴可以跟隨小編一起學(xué)習(xí)一下2025-03-03
Python實(shí)用秘技之快速優(yōu)化導(dǎo)包順序詳解
這篇文章主要來(lái)和大家分享一個(gè)Python中的實(shí)用秘技,那就是如何快速優(yōu)化導(dǎo)包順序,文中的示例代碼簡(jiǎn)潔易懂,快跟隨小編一起學(xué)習(xí)起來(lái)吧2023-06-06
詳解 Python 與文件對(duì)象共事的實(shí)例
這篇文章主要介紹了詳解 Python 與文件對(duì)象共事的實(shí)例的相關(guān)資料,希望通過(guò)本文大家能掌握這部分內(nèi)容,需要的朋友可以參考下2017-09-09
Pandas 中的join函數(shù)應(yīng)用實(shí)現(xiàn)刪除多余的空行
這篇文章主要介紹了Pandas 中的join函數(shù)應(yīng)用實(shí)現(xiàn)刪除多余的空行,str.join也就是sequence要連接的元素序列,下面我們來(lái)看看他的作用實(shí)現(xiàn)刪除多余的空行,需要的小伙伴可以參考一下2022-02-02
Python爬取視頻時(shí)長(zhǎng)場(chǎng)景實(shí)踐示例
這篇文章主要為大家介紹了Python獲取視頻時(shí)長(zhǎng)場(chǎng)景實(shí)踐示例,有需要的朋友可以借鑒參考下,希望能夠有所幫助,祝大家多多進(jìn)步,早日升職加薪2022-07-07
python pandas輕松通過(guò)特定列的值多條件去篩選數(shù)據(jù)及contains方法的使用
這篇文章主要介紹了python pandas輕松通過(guò)特定列的值多條件去篩選數(shù)據(jù)及contains方法的使用,具有很好的參考價(jià)值,希望對(duì)大家有所幫助,如有錯(cuò)誤或未考慮完全的地方,望不吝賜教2024-02-02

