Python爬蟲(chóng)庫(kù)BeautifulSoup的介紹與簡(jiǎn)單使用實(shí)例
一、介紹
BeautifulSoup庫(kù)是靈活又方便的網(wǎng)頁(yè)解析庫(kù),處理高效,支持多種解析器。利用它不用編寫正則表達(dá)式即可方便地實(shí)現(xiàn)網(wǎng)頁(yè)信息的提取。
Python常用解析庫(kù)
| 解析器 | 使用方法 | 優(yōu)勢(shì) | 劣勢(shì) |
| Python標(biāo)準(zhǔn)庫(kù) | BeautifulSoup(markup, “html.parser”) | Python的內(nèi)置標(biāo)準(zhǔn)庫(kù)、執(zhí)行速度適中 、文檔容錯(cuò)能力強(qiáng) | Python 2.7.3 or 3.2.2)前的版本中文容錯(cuò)能力差 |
| lxml HTML 解析器 | BeautifulSoup(markup, “l(fā)xml”) | 速度快、文檔容錯(cuò)能力強(qiáng) | 需要安裝C語(yǔ)言庫(kù) |
| lxml XML 解析器 | BeautifulSoup(markup, “xml”) | 速度快、唯一支持XML的解析器 | 需要安裝C語(yǔ)言庫(kù) |
| html5lib | BeautifulSoup(markup, “html5lib”) | 最好的容錯(cuò)性、以瀏覽器的方式解析文檔、生成HTML5格式的文檔 | 速度慢、不依賴外部擴(kuò)展 |
二、快速開(kāi)始
給定html文檔,產(chǎn)生BeautifulSoup對(duì)象
from bs4 import BeautifulSoup html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1">Elsie</a>, <a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and <a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ soup = BeautifulSoup(html_doc,'lxml')
輸出完整文本
print(soup.prettify())
<html> <head> <title> The Dormouse's story </title> </head> <body> <p class="title"> <b> The Dormouse's story </b> </p> <p class="story"> Once upon a time there were three little sisters; and their names were <a class="sister" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1"> Elsie </a> , <a class="sister" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2"> Lacie </a> and <a class="sister" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3"> Tillie </a> ; and they lived at the bottom of a well. </p> <p class="story"> ... </p> </body> </html>
瀏覽結(jié)構(gòu)化數(shù)據(jù)
print(soup.title) #<title>標(biāo)簽及內(nèi)容
print(soup.title.name) #<title>name屬性
print(soup.title.string) #<title>內(nèi)的字符串
print(soup.title.parent.name) #<title>的父標(biāo)簽name屬性(head)
print(soup.p) # 第一個(gè)<p></p>
print(soup.p['class']) #第一個(gè)<p></p>的class
print(soup.a) # 第一個(gè)<a></a>
print(soup.find_all('a')) # 所有<a></a>
print(soup.find(id="link3")) # 所有id='link3'的標(biāo)簽
<title>The Dormouse's story</title> title The Dormouse's story head <p class="title"><b>The Dormouse's story</b></p> ['title'] <a class="sister" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a> [<a class="sister" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>, <a class="sister" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>, <a class="sister" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>] <a class="sister" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>
for link in soup.find_all('a'):
print(link.get('href'))
http://example.com/elsie http://example.com/lacie http://example.com/tillie
獲得所有文字內(nèi)容
print(soup.get_text())
The Dormouse's story The Dormouse's story Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well. ...
自動(dòng)補(bǔ)全標(biāo)簽并進(jìn)行格式化
html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1"><!-- Elsie --></a>, <a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and <a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml')#傳入解析器:lxml print(soup.prettify())#格式化代碼,自動(dòng)補(bǔ)全 print(soup.title.string)#得到title標(biāo)簽里的內(nèi)容
標(biāo)簽選擇器
選擇元素
html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1"><!-- Elsie --></a>, <a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and <a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml')#傳入解析器:lxml print(soup.title)#選擇了title標(biāo)簽 print(type(soup.title))#查看類型 print(soup.head)
獲取標(biāo)簽名稱
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml')#傳入解析器:lxml print(soup.title.name)
獲取標(biāo)簽屬性
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml')#傳入解析器:lxml print(soup.p.attrs['name'])#獲取p標(biāo)簽中,name這個(gè)屬性的值 print(soup.p['name'])#另一種寫法,比較直接
獲取標(biāo)簽內(nèi)容
print(soup.p.string)
標(biāo)簽嵌套選擇
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml')#傳入解析器:lxml print(soup.head.title.string)
子節(jié)點(diǎn)和子孫節(jié)點(diǎn)
html = """
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1">
<span>Elsie</span>
</a>
<a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a>
and
<a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#傳入解析器:lxml
print(soup.p.contents)#獲取指定標(biāo)簽的子節(jié)點(diǎn),類型是list
另一個(gè)方法,child:
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml')#傳入解析器:lxml print(soup.p.children)#獲取指定標(biāo)簽的子節(jié)點(diǎn)的迭代器對(duì)象 for i,children in enumerate(soup.p.children):#i接受索引,children接受內(nèi)容 print(i,children)
輸出結(jié)果與上面的一樣,多了一個(gè)索引。注意,只能用循環(huán)來(lái)迭代出子節(jié)點(diǎn)的信息。因?yàn)橹苯臃祷氐闹皇且粋€(gè)迭代器對(duì)象。
獲取子孫節(jié)點(diǎn):
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml')#傳入解析器:lxml print(soup.p.descendants)#獲取指定標(biāo)簽的子孫節(jié)點(diǎn)的迭代器對(duì)象 for i,child in enumerate(soup.p.descendants):#i接受索引,child接受內(nèi)容 print(i,child)
父節(jié)點(diǎn)和祖先節(jié)點(diǎn)
parent
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml')#傳入解析器:lxml print(soup.a.parent)#獲取指定標(biāo)簽的父節(jié)點(diǎn)
parents
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml')#傳入解析器:lxml print(list(enumerate(soup.a.parents)))#獲取指定標(biāo)簽的祖先節(jié)點(diǎn)
兄弟節(jié)點(diǎn)
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml')#傳入解析器:lxml print(list(enumerate(soup.a.next_siblings)))#獲取指定標(biāo)簽的后面的兄弟節(jié)點(diǎn) print(list(enumerate(soup.a.previous_siblings)))#獲取指定標(biāo)簽的前面的兄弟節(jié)點(diǎn)
標(biāo)準(zhǔn)選擇器
find_all( name , attrs , recursive , text , **kwargs )
可根據(jù)標(biāo)簽名、屬性、內(nèi)容查找文檔。
name
html='''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all('ul'))#查找所有ul標(biāo)簽下的內(nèi)容
print(type(soup.find_all('ul')[0]))#查看其類型
下面的例子就是查找所有ul標(biāo)簽下的li標(biāo)簽:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.find_all('ul'):
print(ul.find_all('li'))
attrs(屬性)
通過(guò)屬性進(jìn)行元素的查找
html='''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(attrs={'id': 'list-1'}))#傳入的是一個(gè)字典類型,也就是想要查找的屬性
print(soup.find_all(attrs={'name': 'elements'}))
查找到的是同樣的內(nèi)容,因?yàn)檫@兩個(gè)屬性是在同一個(gè)標(biāo)簽里面的。
特殊類型的參數(shù)查找:
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.find_all(id='list-1'))#id是個(gè)特殊的屬性,可以直接使用 print(soup.find_all(class_='element')) #class是關(guān)鍵字所以要用class_
text
根據(jù)文本內(nèi)容來(lái)進(jìn)行選擇:
html='''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(text='Foo'))#查找文本為Foo的內(nèi)容,但是返回的不是標(biāo)簽
所以說(shuō)這個(gè)text在做內(nèi)容匹配的時(shí)候比較方便,但是在做內(nèi)容查找的時(shí)候并不是太方便。
方法
find
find用法和findall一模一樣,但是返回的是找到的第一個(gè)符合條件的內(nèi)容輸出。
ind_parents(), find_parent()
find_parents()返回所有祖先節(jié)點(diǎn),find_parent()返回直接父節(jié)點(diǎn)。
find_next_siblings() ,find_next_sibling()
find_next_siblings()返回后面的所有兄弟節(jié)點(diǎn),find_next_sibling()返回后面的第一個(gè)兄弟節(jié)點(diǎn)
find_previous_siblings(),find_previous_sibling()
find_previous_siblings()返回前面所有兄弟節(jié)點(diǎn),find_previous_sibling()返回前面第一個(gè)兄弟節(jié)點(diǎn)
find_all_next(),find_next()
find_all_next()返回節(jié)點(diǎn)后所有符合條件的節(jié)點(diǎn),find_next()返回后面第一個(gè)符合條件的節(jié)點(diǎn)
find_all_previous(),find_previous()
find_all_previous()返回節(jié)點(diǎn)前所有符合條件的節(jié)點(diǎn),find_previous()返回前面第一個(gè)符合條件的節(jié)點(diǎn)
CSS選擇器 通過(guò)select()直接傳入CSS選擇器即可完成選擇
html='''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.select('.panel .panel-heading'))#.代表class,中間需要空格來(lái)分隔
print(soup.select('ul li')) #選擇ul標(biāo)簽下面的li標(biāo)簽
print(soup.select('#list-2 .element')) #'#'代表id。這句的意思是查找id為"list-2"的標(biāo)簽下的,class=element的元素
print(type(soup.select('ul')[0]))#打印節(jié)點(diǎn)類型
再看看層層嵌套的選擇:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
print(ul.select('li'))
獲取屬性
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
print(ul['id'])# 用[ ]即可獲取屬性
print(ul.attrs['id'])#另一種寫法
獲取內(nèi)容
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for li in soup.select('li'):
print(li.get_text())
用get_text()方法就能獲取內(nèi)容了。
總結(jié)
推薦使用lxml解析庫(kù),必要時(shí)使用html.parser
標(biāo)簽選擇篩選功能弱但是速度快 建議使用find()、find_all() 查詢匹配單個(gè)結(jié)果或者多個(gè)結(jié)果
如果對(duì)CSS選擇器熟悉建議使用select()
記住常用的獲取屬性和文本值的方法
更多關(guān)于Python爬蟲(chóng)庫(kù)BeautifulSoup的介紹與簡(jiǎn)單使用實(shí)例請(qǐng)點(diǎn)擊下面的相關(guān)鏈接
- Python實(shí)戰(zhàn)快速上手BeautifulSoup庫(kù)爬取專欄標(biāo)題和地址
- python beautiful soup庫(kù)入門安裝教程
- python爬蟲(chóng)學(xué)習(xí)筆記--BeautifulSoup4庫(kù)的使用詳解
- Python爬蟲(chóng)進(jìn)階之Beautiful Soup庫(kù)詳解
- python BeautifulSoup庫(kù)的安裝與使用
- python用BeautifulSoup庫(kù)簡(jiǎn)單爬蟲(chóng)實(shí)例分析
- python3解析庫(kù)BeautifulSoup4的安裝配置與基本用法
- python3第三方爬蟲(chóng)庫(kù)BeautifulSoup4安裝教程
- Python使用Beautiful?Soup(BS4)庫(kù)解析HTML和XML
相關(guān)文章
分?jǐn)?shù)霸榜! python助你微信跳一跳拿高分
分?jǐn)?shù)霸榜!這篇文章主要為大家詳細(xì)介紹了python助你微信跳一跳拿高分的秘籍,具有一定的參考價(jià)值,感興趣的小伙伴們可以參考一下2018-01-01
Python 游戲大作炫酷機(jī)甲闖關(guān)游戲爆肝數(shù)千行代碼實(shí)現(xiàn)案例進(jìn)階
本篇文章給大家?guī)?lái)Python的一個(gè)游戲大制作—機(jī)甲闖關(guān)冒險(xiǎn),數(shù)千行代碼實(shí)現(xiàn)的游戲,過(guò)程很詳細(xì),對(duì)大家的學(xué)習(xí)或工作具有一定的借鑒價(jià)值,需要的朋友可以參考下2021-10-10
pytorch 同步機(jī)制的實(shí)現(xiàn)
在PyTorch中,當(dāng)多個(gè)算子和內(nèi)核被并行執(zhí)行時(shí),PyTorch 通過(guò) CUDA 的流和事件機(jī)制來(lái)管理并發(fā)和同步,本文就來(lái)介紹一下pytorch 同步機(jī)制,具有一定的參考價(jià)值,感興趣的可以了解一下2024-09-09
Google開(kāi)源的Python格式化工具YAPF的安裝和使用教程
Google的開(kāi)發(fā)者文檔中有一套Python的代碼書寫規(guī)范,而在GitHub上同樣開(kāi)源了一款名為YAPF的命令行程序用作Python的格式化,下面我們就來(lái)看下這款Google開(kāi)源的Python格式化工具YAPF的安裝和使用教程2016-05-05
python+opencv實(shí)現(xiàn)動(dòng)態(tài)物體追蹤
這篇文章主要為大家詳細(xì)介紹了python+opencv實(shí)現(xiàn)動(dòng)態(tài)物體的追蹤,具有一定的參考價(jià)值,感興趣的小伙伴們可以參考一下2018-01-01
python PrettyTable模塊的安裝與簡(jiǎn)單應(yīng)用
prettyTable 是一款很簡(jiǎn)潔但是功能強(qiáng)大的第三方模塊,主要是將輸入的數(shù)據(jù)轉(zhuǎn)化為格式化的形式來(lái)輸出,這篇文章主要介紹了python PrettyTable模塊的安裝與簡(jiǎn)單應(yīng)用,感興趣的小伙伴們可以參考一下2019-01-01
詳解Python如何檢查一個(gè)數(shù)字是否是三態(tài)數(shù)
在數(shù)學(xué)中,三態(tài)數(shù)(Triangular?Number)是一種特殊的數(shù)列,它是由自然數(shù)按照一定規(guī)律排列而成的,本文主要介紹了如何使用Python檢查判斷一個(gè)數(shù)字是否是三態(tài)數(shù),需要的可以參考下2024-03-03
基于Python實(shí)現(xiàn)人臉識(shí)別和焦點(diǎn)人物檢測(cè)功能
基于dlib庫(kù)的模型,實(shí)現(xiàn)人臉識(shí)別和焦點(diǎn)人物的檢測(cè)。最后呈現(xiàn)的效果為焦點(diǎn)人物的識(shí)別框顏色與其他人物框不一樣。對(duì)Python人臉識(shí)別和焦點(diǎn)人物檢測(cè)設(shè)計(jì)過(guò)程感興趣的朋友一起看看吧2021-10-10

