Python lxml模塊的基本使用方法分析
本文實例講述了Python lxml模塊的基本使用方法。分享給大家供大家參考,具體如下:
1 lxml的安裝
安裝方式:pip install lxml
2 lxml的使用
2.1 lxml模塊的入門使用
導入lxml 的 etree 庫 (導入沒有提示不代表不能用)
from lxml import etree
利用etree.HTML,將字符串轉(zhuǎn)化為Element對象,Element對象具有xpath的方法,返回結(jié)果的列表,能夠接受bytes類型的數(shù)據(jù)和str類型的數(shù)據(jù)
html = etree.HTML(text)
ret_list = html.xpath("xpath字符串")
把轉(zhuǎn)化后的element對象轉(zhuǎn)化為字符串,返回bytes類型結(jié)果 etree.tostring(element)
假設我們現(xiàn)有如下的html字符換,嘗試對他進行操作
<div> <ul> <li class="item-1"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li> <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li> <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li> <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li> <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a> # 注意,此處缺少一個 </li> 閉合標簽 </ul> </div>
from lxml import etree
text = ''' <div> <ul>
<li class="item-1"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li>
<li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li>
<li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li>
<li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li>
<li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a>
</ul> </div> '''
html = etree.HTML(text)
print(type(html))
handeled_html_str = etree.tostring(html).decode()
print(handeled_html_str)
輸出為
<class 'lxml.etree._Element'>
<html><body><div> <ul>
<li class="item-1"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li>
<li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li>
<li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li>
<li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li>
<li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a>
</li></ul> </div> </body></html>
可以發(fā)現(xiàn),lxml確實能夠把確實的標簽補充完成,但是請注意lxml是人寫的,很多時候由于網(wǎng)頁不夠規(guī)范,或者是lxml的bug,即使參考url地址對應的響應去提取數(shù)據(jù),任然獲取不到,這個時候我們需要使用etree.tostring的方法,觀察etree到底把html轉(zhuǎn)化成了什么樣子,即根據(jù)轉(zhuǎn)化后的html字符串去進行數(shù)據(jù)的提取。
2.2 lxml的深入練習
接下來我們繼續(xù)操作,假設每個class為item-1的li標簽是1條新聞數(shù)據(jù),如何把這條新聞數(shù)據(jù)組成一個字典
from lxml import etree
text = ''' <div> <ul>
<li class="item-1"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li>
<li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li>
<li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li>
<li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li>
<li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a>
</ul> </div> '''
html = etree.HTML(text)
#獲取href的列表和title的列表
href_list = html.xpath("http://li[@class='item-1']/a/@href")
title_list = html.xpath("http://li[@class='item-1']/a/text()")
#組裝成字典
for href in href_list:
item = {}
item["href"] = href
item["title"] = title_list[href_list.index(href)]
print(item)
輸出為
{'href': 'link1.html', 'title': 'first item'}
{'href': 'link2.html', 'title': 'second item'}
{'href': 'link4.html', 'title': 'fourth item'}
假設在某種情況下,某個新聞的href沒有,那么會怎樣呢?
from lxml import etree
text = ''' <div> <ul>
<li class="item-1"><a>first item</a></li>
<li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li>
<li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li>
<li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li>
<li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a>
</ul> </div> '''
結(jié)果是
{'href': 'link2.html', 'title': 'first item'}
{'href': 'link4.html', 'title': 'second item'}
數(shù)據(jù)的對應全部錯了,這不是我們想要的,接下來通過2.3小節(jié)的學習來解決這個問題
2.3 lxml模塊的進階使用
前面我們?nèi)〉綄傩?,或者是文本的時候,返回字符串 但是如果我們?nèi)〉降氖且粋€節(jié)點,返回什么呢?
返回的是element對象,可以繼續(xù)使用xpath方法,對此我們可以在后面的數(shù)據(jù)提取過程中:先根據(jù)某個標簽進行分組,分組之后再進行數(shù)據(jù)的提取
示例如下:
from lxml import etree
text = ''' <div> <ul>
<li class="item-1"><a>first item</a></li>
<li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li>
<li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li>
<li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li>
<li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a>
</ul> </div> '''
html = etree.HTML(text)
li_list = html.xpath("http://li[@class='item-1']")
print(li_list)
結(jié)果為:
[<Element li at 0x11106cb48>, <Element li at 0x11106cb88>, <Element li at 0x11106cbc8>]
可以發(fā)現(xiàn)結(jié)果是一個element對象,這個對象能夠繼續(xù)使用xpath方法
先根據(jù)li標簽進行分組,之后再進行數(shù)據(jù)的提取
from lxml import etree
text = ''' <div> <ul>
<li class="item-1"><a>first item</a></li>
<li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li>
<li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li>
<li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li>
<li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a>
</ul> </div> '''
#根據(jù)li標簽進行分組
html = etree.HTML(text)
li_list = html.xpath("http://li[@class='item-1']")
#在每一組中繼續(xù)進行數(shù)據(jù)的提取
for li in li_list:
item = {}
item["href"] = li.xpath("./a/@href")[0] if len(li.xpath("./a/@href"))>0 else None
item["title"] = li.xpath("./a/text()")[0] if len(li.xpath("./a/text()"))>0 else None
print(item)
結(jié)果是:
{'href': None, 'title': 'first item'}
{'href': 'link2.html', 'title': 'second item'}
{'href': 'link4.html', 'title': 'fourth item'}
前面的代碼中,進行數(shù)據(jù)提取需要判斷,可能某些一面不存在數(shù)據(jù)的情況,對應的可以使用三元運算符來解決
PS:這里再為大家提供幾款關于xml操作的在線工具供大家參考使用:
在線XML/JSON互相轉(zhuǎn)換工具:
http://tools.jb51.net/code/xmljson
在線格式化XML/在線壓縮XML:
http://tools.jb51.net/code/xmlformat
XML在線壓縮/格式化工具:
http://tools.jb51.net/code/xml_format_compress
XML代碼在線格式化美化工具:
http://tools.jb51.net/code/xmlcodeformat
更多關于Python相關內(nèi)容感興趣的讀者可查看本站專題:《Python操作xml數(shù)據(jù)技巧總結(jié)》、《Python數(shù)據(jù)結(jié)構(gòu)與算法教程》、《Python Socket編程技巧總結(jié)》、《Python函數(shù)使用技巧總結(jié)》、《Python字符串操作技巧匯總》、《Python入門與進階經(jīng)典教程》及《Python文件與目錄操作技巧匯總》
希望本文所述對大家Python程序設計有所幫助。
相關文章
以視頻爬取實例講解Python爬蟲神器Beautiful Soup用法
這篇文章主要以視頻爬取實例來講解Python爬蟲神器Beautiful Soup的用法,Beautiful Soup是一個為Python獲取數(shù)據(jù)而設計的包,簡潔而強大,需要的朋友可以參考下2016-01-01
通過淘寶數(shù)據(jù)爬蟲學習python?scrapy?requests與response對象
本文主要介紹了通過淘寶數(shù)據(jù)爬蟲學習python?scrapy?requests與response對象,首先從Resquest和Response對象開始展開詳細文章,需要的小伙伴可以參考一下2022-05-05
Python實現(xiàn)向好友發(fā)送微信消息優(yōu)化篇
利用python可以實現(xiàn)微信消息發(fā)送功能,怎么實現(xiàn)呢?你肯定會想著很復雜,但是python的好處就是很多人已經(jīng)把接口打包做好了,只需要調(diào)用即可,今天通過本文給大家分享使用?Python?實現(xiàn)微信消息發(fā)送的思路代碼,一起看看吧2022-06-06
Python常見數(shù)據(jù)結(jié)構(gòu)詳解
這篇文章主要介紹了Python常見數(shù)據(jù)結(jié)構(gòu),需要的朋友可以參考下2014-07-07
對pandas的算術運算和數(shù)據(jù)對齊實例詳解
今天小編就為大家分享一篇對pandas的算術運算和數(shù)據(jù)對齊實例詳解,具有很好的參考價值,希望對大家有所幫助。一起跟隨小編過來看看吧2018-12-12
Pycharm導入anaconda環(huán)境的教程圖解
這篇文章主要介紹了Pycharm導入anaconda環(huán)境的教程,本文通過圖文并茂的形式給大家介紹的非常詳細,對大家的學習或工作具有一定的參考借鑒價值,需要的朋友可以參考下2020-07-07
pycharm 2020.2.4 pip install Flask 報錯 Error:Non-zero exit co
這篇文章主要介紹了pycharm 2020.2.4 pip install Flask 報錯 Error:Non-zero exit code,本文給大家介紹的非常詳細,對大家的學習或工作具有一定的參考借鑒價值,需要的朋友可以參考下2020-12-12

