使用Requests庫來進(jìn)行爬蟲的方式

更新時(shí)間：2022年11月18日 10:10:19 作者：Mr.Bean-Pig

這篇文章主要介紹了使用Requests庫來進(jìn)行爬蟲的方式，具有很好的參考價(jià)值，希望對(duì)大家有所幫助。如有錯(cuò)誤或未考慮完全的地方，望不吝賜教

Requests是用Python編寫，基于urllib，采用Apache2 Licensed開源協(xié)議的HTTP庫。

它比urllib更方便，可以節(jié)約我們大量的工作，完全滿足HTTP測試需求。

安裝：

pip3 install requests

使用

實(shí)例：

import requests

response=requests.get('https://www.baidu.com')
print(type(response))
print(response.status_code)
print(type(response.text))
print(response.text)
print(response.cookies)

各種請(qǐng)求方式

import requests

requests.post('http://httpbin.org/post')
requests.put('http://httpbin.org/put')
requests.delete('http://httpbin.org/delete')
requests.head('http://httpbin.org/get')
requests.options('http://httpbin.org/get')

我們執(zhí)行以上命令后，可以在這個(gè)網(wǎng)址進(jìn)行驗(yàn)證：

http://httpbin.org 這可以作為一個(gè)測試網(wǎng)址，它可以反饋一些我們請(qǐng)求時(shí)的信息。例如：

可以查看我們請(qǐng)求時(shí)的ip地址。

基本get請(qǐng)求

基本寫法

import requests

response=requests.get('http://httpbin.org/get')#用get方式發(fā)送請(qǐng)求并獲得響應(yīng)
print(response.text)#用text查看響應(yīng)內(nèi)容

帶參數(shù)get

import requests

response=requests.get('http://httpbin.org/get?name=zhuzhu&age=23')
#將參數(shù)拼接到url后面，用問號(hào)分隔，參數(shù)間用&來分隔
print(response.text)

可以看到返回的args信息中包含了我們的get參數(shù)。但是這種方法使用得不是很方便，再看下面的方式：

import requests

data={
	'name':'zhuzhu',
	'age':23
}
response=requests.get('http://httpbin.org/get',params=data)
#用字典的形式傳遞給params參數(shù)，不需要自己寫url編碼
print(response.text)

得到的結(jié)果與上面的方法是一樣的，但是方便了許多~

解析json

import requests

response=requests.get("http://httpbin.org/get")
print(type(response.text))
print(response.json())#把返回結(jié)果編碼成一個(gè)json對(duì)象
print(type(response.json()))

這個(gè)方法在返回一些AJEX請(qǐng)求時(shí)是比較常用的。

獲取二進(jìn)制數(shù)據(jù)

在下載一些內(nèi)容（圖片、視頻）的時(shí)候常用的一個(gè)方法。

試試看，我們想要獲取一個(gè)github的圖標(biāo)：

import requests

response=requests.get("https://github.com/favicon.ico")
print(type(response.text),type(response.content))
print(response.text)
print(response.content)#可以使用content屬性來獲取二進(jìn)制內(nèi)容

可以看到，在響應(yīng)中，text的類型是string，而content的內(nèi)容是bytes，也就是二進(jìn)制形式。

怎么把這個(gè)圖標(biāo)保存到本地呢？我們已經(jīng)知道怎么獲取它的二進(jìn)制內(nèi)容，現(xiàn)在只需要寫入文件就可以了：

import requests

response=requests.get("https://github.com/favicon.ico")
with open('favicon.ico','wb')as f:
	f.write(response.content)
	f.close()

哈哈，成功保存在運(yùn)行目錄下了~

添加headers

headers在爬蟲中是非常必要的，很多時(shí)候如果請(qǐng)求不加headers，那么你可能會(huì)被禁掉或出現(xiàn)服務(wù)器錯(cuò)誤…

比如我們現(xiàn)在想爬取知乎上的數(shù)據(jù)，但是不加headers：

import requests
response=requests.get("https://www.zhihu.com/explore")
print(response.text)

那么就會(huì)報(bào)錯(cuò)，因?yàn)橹跏且R(shí)別你的瀏覽器信息的。

我們現(xiàn)在加入headers試試看（做一個(gè)瀏覽器的偽裝），只需要向get方法傳入headers參數(shù)就好了：

import requests

headers={
	'User-Agent':'Mozilla/5.0(Macintosh;Intel Mac OS X 10_11_4)AppleWebKit/537.36(KHTML,like Gecko)Chrome/52.0.2743.116 Safari/537.36'
}
response=requests.get("https://www.zhihu.com/explore",headers=headers)
print(response.text)

運(yùn)行結(jié)果成功返回了響應(yīng)信息。

基本POST請(qǐng)求

直接用字典構(gòu)造一個(gè)data并傳入方法，就可以實(shí)現(xiàn)post請(qǐng)求了，省去了編碼步驟，比起urllib方便許多：

import requests


data={'name':'zhuzhu','age':'23'}
response=requests.post("http://httpbin.org/post",data=data)
print(response.text)

再加入headers：

import requests


data={'name':'zhuzhu','age':'23'}
headers={
	'User-Agent':'Mozilla/5.0(Macintosh;Intel Mac OS X 10_11_4)AppleWebKit/537.36(KHTML,like Gecko)Chrome/52.0.2743.116 Safari/537.36'
}
response=requests.post("http://httpbin.org/post",data=data,headers=headers)
print(response.json())

可以看到，返回的json形式的響應(yīng)中，我們成功添加了data和headers的信息。

總結(jié)：get和post請(qǐng)求使用都很方便，區(qū)別只是換一下方法而已。

響應(yīng)

response屬性

下面列出了常用的response屬性：

import requests

response=requests.get("http://www.jianshu.com")
print(type(response.status_code),response.status_code)#狀態(tài)碼
print(type(response.headers),response.headers)
print(type(response.cookies),response.cookies)
print(type(response.url),response.url)
print(type(response.history),response.history)

狀態(tài)碼判斷

常見的網(wǎng)頁狀態(tài)碼：

100: (‘continue’,),
101: (‘switching_protocols’,),
102: (‘processing’,),
103: (‘checkpoint’,),
122: (‘uri_too_long’, ‘request_uri_too_long’),
200: (‘ok’, ‘okay’, ‘all_ok’, ‘all_okay’, ‘all_good’, ‘\o/’, ‘?’),
201: (‘created’,),
202: (‘accepted’,),
203: (‘non_authoritative_info’, ‘non_authoritative_information’),
204: (‘no_content’,),
205: (‘reset_content’, ‘reset’),
206: (‘partial_content’, ‘partial’),
207: (‘multi_status’, ‘multiple_status’, ‘multi_stati’, ‘multiple_stati’),
208: (‘already_reported’,),
226: (‘im_used’,),

Redirection.
300: (‘multiple_choices’,),
301: (‘moved_permanently’, ‘moved’, ‘\o-’),
302: (‘found’,),
303: (‘see_other’, ‘other’),
304: (‘not_modified’,),
305: (‘use_proxy’,),
306: (‘switch_proxy’,),
307: (‘temporary_redirect’, ‘temporary_moved’, ‘temporary’),
308: (‘permanent_redirect’,
‘resume_incomplete’, ‘resume’,), # These 2 to be removed in 3.0

Client Error.
400: (‘bad_request’, ‘bad’),
401: (‘unauthorized’,),
402: (‘payment_required’, ‘payment’),
403: (‘forbidden’,),
404: (‘not_found’, ‘-o-’),
405: (‘method_not_allowed’, ‘not_allowed’),
406: (‘not_acceptable’,),
407: (‘proxy_authentication_required’, ‘proxy_auth’, ‘proxy_authentication’),
408: (‘request_timeout’, ‘timeout’),
409: (‘conflict’,),
410: (‘gone’,),
411: (‘length_required’,),
412: (‘precondition_failed’, ‘precondition’),
413: (‘request_entity_too_large’,),
414: (‘request_uri_too_large’,),
415: (‘unsupported_media_type’, ‘unsupported_media’, ‘media_type’),
416: (‘requested_range_not_satisfiable’, ‘requested_range’, ‘range_not_satisfiable’),
417: (‘expectation_failed’,),
418: (‘im_a_teapot’, ‘teapot’, ‘i_am_a_teapot’),
421: (‘misdirected_request’,),
422: (‘unprocessable_entity’, ‘unprocessable’),
423: (‘locked’,),
424: (‘failed_dependency’, ‘dependency’),
425: (‘unordered_collection’, ‘unordered’),
426: (‘upgrade_required’, ‘upgrade’),
428: (‘precondition_required’, ‘precondition’),
429: (‘too_many_requests’, ‘too_many’),
431: (‘header_fields_too_large’, ‘fields_too_large’),
444: (‘no_response’, ‘none’),
449: (‘retry_with’, ‘retry’),
450: (‘blocked_by_windows_parental_controls’, ‘parental_controls’),
451: (‘unavailable_for_legal_reasons’, ‘legal_reasons’),
499: (‘client_closed_request’,),

Server Error.
500: (‘internal_server_error’, ‘server_error’, ‘/o\’, ‘?’),
501: (‘not_implemented’,),
502: (‘bad_gateway’,),
503: (‘service_unavailable’, ‘unavailable’),
504: (‘gateway_timeout’,),
505: (‘http_version_not_supported’, ‘http_version’),
506: (‘variant_also_negotiates’,),
507: (‘insufficient_storage’,),
509: (‘bandwidth_limit_exceeded’, ‘bandwidth’),
510: (‘not_extended’,),
511: (‘network_authentication_required’, ‘network_auth’, ‘network_authentication’),

示例：

import requests

response=requests.get("http://www.baidu.com")
exit() if not response.status_code==200 else print("Requests Successfully")

這說明這次請(qǐng)求的狀態(tài)碼為200.

另一種寫法就是把數(shù)字200換位相應(yīng)的字符串內(nèi)容，詳細(xì)的對(duì)應(yīng)方式見上面列出的關(guān)系。

比如200對(duì)應(yīng)著其中一個(gè)字符串是“ok”，我們?cè)囋嚕?/p>

import requests

response=requests.get("http://www.baidu.com")
exit() if not response.status_code==requests.codes.ok else print("Requests Successfully")

可以看到效果是一樣的，可以根據(jù)實(shí)際情況選用。

高級(jí)操作

文件上傳

import requests

files={'file':open('favicon.ico','rb')}
#通過files參數(shù)傳入post方法中，實(shí)現(xiàn)文件的上傳
response=requests.post("http://httpbin.org/post",files=files)
print(response.text)

這樣通過post請(qǐng)求，我們就完成了文件的上傳，下圖file顯示的就是文件的字節(jié)流了：

獲取cookie

上面提到過，可以直接使用response.cookies就可以打印出cookie了。

實(shí)際上cookies是一個(gè)列表的形式，我們可以用for循環(huán)把每一個(gè)cookie取出來并且打印其key-value：

import requests

response=requests.get("http://www.baidu.com")
print(response.cookies)
for key,value in response.cookies.items():
	print(key+'='+value)

如上圖，非常方便地獲取到了cookie信息。這比起urllib可要方便不少~

會(huì)話維持

基本上為了實(shí)現(xiàn)“模擬登錄”的功能。

來看例子：

import requests

requests.get('http://httpbin.org/cookies/set/number/123456789')
#通過cookies/set方法來設(shè)置cookie
response=requests.get('http://httpbin.org/cookies')
print(response.text)

咦，cookies為空，和我們想象的不太一樣。這是因?yàn)樯厦婺嵌未a中發(fā)起了兩次get請(qǐng)求，相當(dāng)于兩個(gè)瀏覽器，相互獨(dú)立，所以第二次get并不能得到第一次的cookie。

那么需要采用下面的方法，通過聲明Session對(duì)象來發(fā)起兩次get請(qǐng)求，視為一個(gè)瀏覽器中進(jìn)行的操作：

import requests

s=requests.Session()
s.get('http://httpbin.org/cookies/set/number/123456789')
#通過cookies/set方法來設(shè)置cookie
response=s.get('http://httpbin.org/cookies')
print(response.text)

這回就成功了~

這個(gè)方法是比較常用的，用來模擬一個(gè)登錄會(huì)話并維持之，這樣就可以獲取登錄后的頁面了。

證書驗(yàn)證

如果我們要爬取的是一個(gè)https協(xié)議的網(wǎng)站，那么網(wǎng)站首先會(huì)檢查證書是否是合法的，若非法，會(huì)直接拋出SSLError錯(cuò)誤。如果要避免這種錯(cuò)誤的話，可以把這個(gè)參數(shù)：verify設(shè)置為False就可以了（默認(rèn)是True）。

先看未設(shè)置的：

import requests
response=requests.get('https://www.12306.cn')
print(response.status_code)

拋出了SSLError錯(cuò)誤。

再看設(shè)置過的：

import requests


response=requests.get('https://www.12306.cn',verify=False)#把verify參數(shù)置否
print(response.status_code)

如此就返回了200的狀態(tài)碼，說明這個(gè)請(qǐng)求是正常的，沒有進(jìn)行證書認(rèn)證。

但是仍然會(huì)有警告信息，提示你最好加上證書驗(yàn)證。那么怎么消除這個(gè)警告信息呢？

可以從原生包中導(dǎo)入urllib3并使用其中的禁用警告這個(gè)方法：

import requests
from requests.packages import urllib3
urllib3.disable_warnings()#禁用警告信息


response=requests.get('https://www.12306.cn',verify=False)
print(response.status_code)

這樣就不會(huì)有警告信息了。

怎么手動(dòng)添加證書呢？

示例：通過cert來指定本地證書

import requests

response=requests.get('https://www.12306.cn',cert=('/path/server.crt','/path/key'))
print(response.status_code)

但由于我這兒沒有本地證書，就不進(jìn)行演示了。

代理設(shè)置

可以通過字典形式構(gòu)造一個(gè)參數(shù)，字典里是你已經(jīng)開通的代理ip。再把參數(shù)傳入get方法即可。

import requests

proxies={
?? ?"http":"http://127.0.0.1:9743",
?? ?"https":"https://127.0.0.1:9743"
}

response=requests.get("https://www.taobao.com",proxies=proxies)
print(response.status_code)

如果代理需要用戶名和密碼的時(shí)候怎么辦呢？

我們可以在代理的url前面直接傳一個(gè)user：password，后面加個(gè)@符號(hào)，這樣我們就能傳入用戶名和密碼這個(gè)認(rèn)證信息了：

proxies={
?? ?"http":"http://uesr:password@127.0.0.1:9743/",?? ?
}

那如果代理方式不是https，而是一個(gè)socks類型的呢？

首先需要安裝，在命令行執(zhí)行（windows環(huán)境下）：

pip3 install request[socks]

安裝之后就可以使用這種形式的代理了。

import requests

proxies={
?? ?"http":"sock5://127.0.0.1:9743",
?? ?"https":"socks5://127.0.0.1:9743"
}

response=requests.get("https://www.taobao.com",proxies=proxies)
print(response.status_code)

超時(shí)設(shè)置

import requests

response=requests.get("https://www.taobao.com",timeout=1)
#設(shè)置一個(gè)時(shí)間限制，必須在1秒內(nèi)得到應(yīng)答
print(response.status_code)

如果時(shí)間超出了限制，就會(huì)拋出異常。怎么捕獲這個(gè)異常呢？

import requests
from requests.exceptions import ReadTimeout

try:
?? ?response=requests.get("https://httpbin.org/get",timeout=0.5)
?? ?print(response.status_code)
except ReadTimeout:
?? ?print('Timeout')

成功捕獲了這個(gè)異常，并進(jìn)行處理（報(bào)信息）。

認(rèn)證設(shè)置

有的網(wǎng)站在訪問時(shí)需要輸入用戶名和密碼，輸入之后才能看到網(wǎng)站的內(nèi)容。

如果遇到這種網(wǎng)站，我們可以通過auth參數(shù)，把用戶名和密碼傳入。

import requests
from requests.auth import HTTPBasicAuth

r=requests.get('http://120.27.34.24:9001',auth=HTTPBasicAuth('user','123'))
#通過auth參數(shù)傳入。
print(r.status_code)

這樣就可以完成一個(gè)正常的請(qǐng)求，如果把a(bǔ)uth參數(shù)去掉，那么就會(huì)返回401參數(shù)（請(qǐng)求被禁止）。

異常處理

異常處理的部分還是比較重要的，它可以保證你的爬蟲不間斷地運(yùn)行。

原則還是先捕獲子類異常，再捕捉父類異常（RequestException）。

import requests
from requests.exceptions import ReadTimeout,HTTPError,RequestException
try:
	response=requests.get('http://httpbin.org/get',timeout=0.5)
	print(response.status_code)
except ReadTimeout:#捕獲超時(shí)異常
	print('Timeout')
except HTTPError:#捕獲HTTP異常
	print('Http error')
except ConnectionError:#捕獲連接異常
	print('Connection error')
except RequestException:#捕獲父類異常
	print('Error')

以上為個(gè)人經(jīng)驗(yàn)，希望能給大家一個(gè)參考，也希望大家多多支持腳本之家。

您可能感興趣的文章:

国产无遮挡裸体免费直播视频,久久精品国产蜜臀av,动漫在线视频一区二区,欧亚日韩一区二区三区,久艹在线免费视频,国产精品美女网站免费,正在播放 97超级视频在线观看,斗破苍穹年番在线观看免费,51最新乱码中文字幕

使用Requests庫來進(jìn)行爬蟲的方式

目錄

使用

各種請(qǐng)求方式

基本get請(qǐng)求

基本寫法

帶參數(shù)get

解析json

獲取二進(jìn)制數(shù)據(jù)

添加headers

基本POST請(qǐng)求

響應(yīng)

response屬性

狀態(tài)碼判斷

高級(jí)操作

文件上傳

獲取cookie

會(huì)話維持

證書驗(yàn)證

代理設(shè)置

超時(shí)設(shè)置

認(rèn)證設(shè)置

異常處理

相關(guān)文章

最新評(píng)論

大家感興趣的內(nèi)容

最近更新的內(nèi)容

常用在線小工具

国产无遮挡裸体免费直播视频,久久精品国产蜜臀av,动漫在线视频一区二区,欧亚日韩一区二区三区,久艹在线 免费视频,国产精品美女网站免费,正在播放 97超级视频在线观看,斗破苍穹年番在线观看免费,51最新乱码中文字幕

使用Requests庫來進(jìn)行爬蟲的方式

目錄

使用

各種請(qǐng)求方式

基本get請(qǐng)求

基本寫法

帶參數(shù)get

解析json

獲取二進(jìn)制數(shù)據(jù)

添加headers

基本POST請(qǐng)求

響應(yīng)

response屬性

狀態(tài)碼判斷

高級(jí)操作

文件上傳

獲取cookie

會(huì)話維持

證書驗(yàn)證

代理設(shè)置

超時(shí)設(shè)置

認(rèn)證設(shè)置

異常處理

相關(guān)文章

最新評(píng)論

大家感興趣的內(nèi)容

最近更新的內(nèi)容

常用在線小工具

国产无遮挡裸体免费直播视频,久久精品国产蜜臀av,动漫在线视频一区二区,欧亚日韩一区二区三区,久艹在线免费视频,国产精品美女网站免费,正在播放 97超级视频在线观看,斗破苍穹年番在线观看免费,51最新乱码中文字幕