python beautiful soup庫(kù)入門安裝教程
beautiful soup庫(kù)的安裝
pip install beautifulsoup4
beautiful soup庫(kù)的理解
beautiful soup庫(kù)是解析、遍歷、維護(hù)“標(biāo)簽樹(shù)”的功能庫(kù)
beautiful soup庫(kù)的引用
from bs4 import BeautifulSoup import bs4
BeautifulSoup類
BeautifulSoup對(duì)應(yīng)一個(gè)HTML/XML文檔的全部?jī)?nèi)容
回顧demo.html
import requests
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
print(demo)
<html><head><title>This is a python demo page</title></head> <body> <p class="title"><b>The demo python introduces several python courses.</b></p> <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses: <a class="py1" id="link1">Basic Python</a> and <a class="py2" id="link2">Advanced Python</a>.</p> </body></html>
Tag標(biāo)簽
| 基本元素 | 說(shuō)明 |
|---|---|
| Tag | 標(biāo)簽,最基本的信息組織單元,分別用<>和</>標(biāo)明開(kāi)頭和結(jié)尾 |
import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
soup = BeautifulSoup(demo,"html.parser")
print(soup.title)
tag = soup.a
print(tag)
<title>This is a python demo page</title> <a >Basic Python</a>
任何存在于HTML語(yǔ)法中的標(biāo)簽都可以用soup.訪問(wèn)獲得。當(dāng)HTML文檔中存在多個(gè)相同對(duì)應(yīng)內(nèi)容時(shí),soup.返回第一個(gè)
Tag的name
| 基本元素 | 說(shuō)明 |
|---|---|
| Name | 標(biāo)簽的名字,
… 的名字是'p',格式:.name |
import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
soup = BeautifulSoup(demo,"html.parser")
print(soup.a.name)
print(soup.a.parent.name)
print(soup.a.parent.parent.name)
a p body
Tag的attrs(屬性)
| 基本元素 | 說(shuō)明 |
|---|---|
| Attributes | 標(biāo)簽的屬性,字典形式組織,格式:.attrs |
import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
soup = BeautifulSoup(demo,"html.parser")
tag = soup.a
print(tag.attrs)
print(tag.attrs['class'])
print(tag.attrs['href'])
print(type(tag.attrs))
print(type(tag))
{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
['py1']
http://www.icourse163.org/course/BIT-268001
<class 'dict'>
<class 'bs4.element.Tag'>
Tag的NavigableString
Tag的NavigableString
| 基本元素 | 說(shuō)明 |
|---|---|
| NavigableString | 標(biāo)簽內(nèi)非屬性字符串,<>…</>中字符串,格式:.string |
| 基本元素 | 說(shuō)明 |
|---|---|
| Comment | 標(biāo)簽內(nèi)字符串的注釋部分,一種特殊的Comment類型 |
import requests
from bs4 import BeautifulSoup
newsoup = BeautifulSoup("<b><!--This is a comment--></b><p>This is not a comment</p>","html.parser")
print(newsoup.b.string)
print(type(newsoup.b.string))
print(newsoup.p.string)
print(type(newsoup.p.string))
This is a comment <class 'bs4.element.Comment'> This is not a comment <class 'bs4.element.NavigableString'>
HTML基本格式
標(biāo)簽樹(shù)的下行遍歷
| 屬性 | 說(shuō)明 |
|---|---|
| .contents | 子節(jié)點(diǎn)的列表,將所有兒子結(jié)點(diǎn)存入列表 |
| .children | 子節(jié)點(diǎn)的迭代類型,與.contents類似,用于循環(huán)遍歷兒子結(jié)點(diǎn) |
| .descendents | 子孫節(jié)點(diǎn)的迭代類型,包含所有子孫節(jié)點(diǎn),用于循環(huán)遍歷 |
BeautifulSoup類型是標(biāo)簽樹(shù)的根節(jié)點(diǎn)
import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
soup = BeautifulSoup(demo,"html.parser")
print(soup.head)
print(soup.head.contents)
print(soup.body.contents)
print(len(soup.body.contents))
print(soup.body.contents[1])
<head><title>This is a python demo page</title></head> [<title>This is a python demo page</title>] ['\n', <p ><b>The demo python introduces several python courses.</b></p>, '\n', <p >Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses: <a >Basic Python</a> and <a >Advanced Python</a>.</p>, '\n'] 5 <p ><b>The demo python introduces several python courses.</b></p>
for child in soup.body.children: print(child) #遍歷兒子結(jié)點(diǎn) for child in soup.body.descendants: print(child) #遍歷子孫節(jié)點(diǎn)
標(biāo)簽樹(shù)的上行遍歷
| 屬性 | 說(shuō)明 |
|---|---|
| .parent | 節(jié)點(diǎn)的父親標(biāo)簽 |
| .parents | 節(jié)點(diǎn)先輩標(biāo)簽的迭代類型,用于循環(huán)遍歷先輩節(jié)點(diǎn) |
import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
soup = BeautifulSoup(demo,"html.parser")
print(soup.title.parent)
print(soup.html.parent)
<head><title>This is a python demo page</title></head> <html><head><title>This is a python demo page</title></head> <body> <p ><b>The demo python introduces several python courses.</b></p> <p >Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses: <a >Basic Python</a> and <a >Advanced Python</a>.</p> </body></html>
import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
soup = BeautifulSoup(demo,"html.parser")
for parent in soup.a.parents:
if parent is None:
print(parent)
else:
print(parent.name)
p body html [document]
標(biāo)簽的平行遍歷
| 屬性 | 說(shuō)明 |
|---|---|
| .next_sibling | 返回按照HTML文本順序的下一個(gè)平行節(jié)點(diǎn)標(biāo)簽 |
| .previous.sibling | 返回按照HTML文本順序的上一個(gè)平行節(jié)點(diǎn)標(biāo)簽 |
| .next_siblings | 迭代類型,返回按照HTML文本順序的后續(xù)所有平行節(jié)點(diǎn)標(biāo)簽 |
| .previous.siblings | 迭代類型,返回按照HTML文本順序的前續(xù)所有平行節(jié)點(diǎn)標(biāo)簽 |
import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
soup = BeautifulSoup(demo,"html.parser")
print(soup.a.next_sibling)
print(soup.a.next_sibling.next_sibling)
print(soup.a.previous_sibling)
print(soup.a.previous_sibling.previous_sibling)
print(soup.a.parent)
and <a class="py2" id="link2">Advanced Python</a> Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses: None <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses: <a class="py1" id="link1">Basic Python</a> and <a class="py2" id="link2">Advanced Python</a>.</p>
for sibling in soup.a.next_sibling: print(sibling) #遍歷后續(xù)節(jié)點(diǎn) for sibling in soup.a.previous_sibling: print(sibling) #遍歷前續(xù)節(jié)點(diǎn)

bs庫(kù)的prettify()方法
import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
soup = BeautifulSoup(demo,"html.parser")
print(soup.prettify())
<html>
<head>
<title>
This is a python demo page
</title>
</head>
<body>
<p class="title">
<b>
The demo python introduces several python courses.
</b>
</p>
<p class="course">
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
Basic Python
</a>
and
<a class="py2" id="link2">
Advanced Python
</a>
.
</p>
</body>
</html>
.prettify()為HTML文本<>及其內(nèi)容增加更加'\n'
.prettify()可用于標(biāo)簽,方法:.prettify()
bs4庫(kù)的編碼
bs4庫(kù)將任何HTML輸入都變成utf-8編碼
python 3.x默認(rèn)支持編碼是utf-8,解析無(wú)障礙
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup("<p>中文</p>","html.parser")
print(soup.p.string)
print(soup.p.prettify())
中文 <p> 中文 </p>
到此這篇關(guān)于python beautiful soup庫(kù)入門安裝教程的文章就介紹到這了,更多相關(guān)python beautiful soup庫(kù)入門內(nèi)容請(qǐng)搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家!
- python3第三方爬蟲庫(kù)BeautifulSoup4安裝教程
- Python爬蟲包BeautifulSoup簡(jiǎn)介與安裝(一)
- Python網(wǎng)頁(yè)解析利器BeautifulSoup安裝使用介紹
- python爬蟲開(kāi)發(fā)之Beautiful Soup模塊從安裝到詳細(xì)使用方法與實(shí)例
- Windows8下安裝Python的BeautifulSoup
- python3解析庫(kù)BeautifulSoup4的安裝配置與基本用法
- python BeautifulSoup庫(kù)的安裝與使用
- python解析庫(kù)Beautiful?Soup安裝的詳細(xì)步驟
相關(guān)文章
如何實(shí)現(xiàn)python爬蟲爬取視頻時(shí)實(shí)現(xiàn)實(shí)時(shí)進(jìn)度條顯示
這篇文章主要介紹了如何實(shí)現(xiàn)python爬蟲爬取視頻時(shí)實(shí)現(xiàn)實(shí)時(shí)進(jìn)度條顯示,在爬取并下載網(wǎng)頁(yè)上的視頻的時(shí)候,我們需要實(shí)時(shí)進(jìn)度條,這可以幫助我們更直觀的看到視頻的下載進(jìn)度。文章圍繞主題展開(kāi)更多內(nèi)容,需要的小伙伴可以參考一下2022-06-06
python庫(kù)umap有效地揭示高維數(shù)據(jù)的結(jié)構(gòu)和模式初探
這篇文章主要介紹了python庫(kù)umap有效地揭示高維數(shù)據(jù)的結(jié)構(gòu)和模式初探,有需要的朋友可以借鑒參考下,希望能夠有所幫助,祝大家多多進(jìn)步,早日升職加薪2024-01-01
Python爬蟲使用實(shí)例wallpaper問(wèn)題記錄
本文介紹解決中文亂碼的方法,以及Python爬蟲處理數(shù)據(jù)、圖片URL的技巧,包括使用正則表達(dá)式處理字符串、URL替換等,還涉及單線程與多線程的應(yīng)用場(chǎng)景,如電腦壁紙和手機(jī)壁紙爬取,適合進(jìn)行Web數(shù)據(jù)抓取和處理的開(kāi)發(fā)者參考2024-09-09
對(duì)python3中, print橫向輸出的方法詳解
今天小編就為大家分享一篇對(duì)python3中, print橫向輸出的方法詳解,具有很好的參考價(jià)值,希望對(duì)大家有所幫助。一起跟隨小編過(guò)來(lái)看看吧2019-01-01
python?pandas處理excel表格數(shù)據(jù)的常用方法總結(jié)
在計(jì)算機(jī)編程中,pandas是Python編程語(yǔ)言的用于數(shù)據(jù)操縱和分析的軟件庫(kù),下面這篇文章主要給大家介紹了關(guān)于python?pandas處理excel表格數(shù)據(jù)的常用方法,文中通過(guò)實(shí)例代碼介紹的非常詳細(xì),需要的朋友可以參考下2022-07-07
Python+SimpleRNN實(shí)現(xiàn)股票預(yù)測(cè)詳解
這篇文章主要為大家詳細(xì)介紹了如何利用Python和SimpleRNN實(shí)現(xiàn)股票預(yù)測(cè)效果,文中的示例代碼講解詳細(xì),對(duì)我們學(xué)習(xí)有一定幫助,需要的可以參考一下2022-05-05

