Scrapy-Redis結(jié)合POST請求獲取數(shù)據(jù)的方法示例
前言
通常我們在一個站站點進行采集的時候,如果是小站的話 我們使用scrapy本身就可以滿足。
但是如果在面對一些比較大型的站點的時候,單個scrapy就顯得力不從心了。
要是我們能夠多個Scrapy一起采集該多好啊 人多力量大。
很遺憾Scrapy官方并不支持多個同時采集一個站點,雖然官方給出一個方法:
**將一個站點的分割成幾部分 交給不同的scrapy去采集**
似乎是個解決辦法,但是很麻煩誒!畢竟分割很麻煩的哇
下面就改輪到我們的額主角Scrapy-Redis登場了!
能看到這篇文章的小伙伴肯定已經(jīng)知道什么是Scrapy以及Scrapy-Redis了,基礎(chǔ)概念這里就不再介紹。默認情況下Scrapy-Redis是發(fā)送GET請求獲取數(shù)據(jù)的,對于某些使用POST請求的情況需要重寫make_request_from_data函數(shù)即可,但奇怪的是居然沒在網(wǎng)上搜到簡潔明了的答案,或許是太簡單了?。
這里我以httpbin.org這個網(wǎng)站為例,首先在settings.py中添加所需配置,這里需要根據(jù)實際情況進行修改:
SCHEDULER = "scrapy_redis.scheduler.Scheduler" #啟用Redis調(diào)度存儲請求隊列 SCHEDULER_PERSIST = True #不清除Redis隊列、這樣可以暫停/恢復(fù) 爬取 DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" #確保所有的爬蟲通過Redis去重 SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.SpiderPriorityQueue' REDIS_URL = "redis://127.0.0.1:6379"
爬蟲代碼如下:
# -*- coding: utf-8 -*-
import scrapy
from scrapy_redis.spiders import RedisSpider
class HpbSpider(RedisSpider):
name = 'hpb'
redis_key = 'test_post_data'
def make_request_from_data(self, data):
"""Returns a Request instance from data coming from Redis.
By default, ``data`` is an encoded URL. You can override this method to
provide your own message decoding.
Parameters
----------
data : bytes
Message from redis.
"""
return scrapy.FormRequest("https://www.httpbin.org/post",
formdata={"data":data},callback=self.parse)
def parse(self, response):
print(response.body)
這里為了簡單直接進行輸出,真實使用時可以結(jié)合pipeline寫數(shù)據(jù)庫等。
然后啟動爬蟲程序scrapy crawl hpb,由于我們還沒向test_post_data中寫數(shù)據(jù),所以啟動后程序進入等待狀態(tài)。然后模擬向隊列寫數(shù)據(jù):
import redis
rd = redis.Redis('127.0.0.1',port=6379,db=0)
for _ in range(1000):
rd.lpush('test_post_data',_)
此時可以看到爬蟲已經(jīng)開始獲取程序了:
2019-05-06 16:30:21 [hpb] DEBUG: Read 8 requests from 'test_post_data'
2019-05-06 16:30:21 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.httpbin.org/post> (referer: None)
2019-05-06 16:30:21 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.httpbin.org/post> (referer: None)
2019-05-06 16:30:21 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.httpbin.org/post> (referer: None)
2019-05-06 16:30:21 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.httpbin.org/post> (referer: None)
2019-05-06 16:30:21 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.httpbin.org/post> (referer: None)
2019-05-06 16:30:21 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.httpbin.org/post> (referer: None)
2019-05-06 16:30:21 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.httpbin.org/post> (referer: None)
2019-05-06 16:30:21 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.httpbin.org/post> (referer: None)
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "data": "0"\n }, \n "headers": {\n "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n "Accept-Encoding": "gzip,deflate", \n "Accept-Language": "en", \n "Content-Length": "6", \n "Content-Type": "application/x-www-form-urlencoded", \n "Host": "
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "data": "1"\n }, \n "headers": {\n "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n "Accept-Encoding": "gzip,deflate", \n "Accept-Language": "en", \n "Content-Length": "6", \n "Content-Type": "application/x-www-form-urlencoded", \n "Host": "
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "data": "3"\n }, \n "headers": {\n "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n "Accept-Encoding": "gzip,deflate", \n "Accept-Language": "en", \n "Content-Length": "6", \n "Content-Type": "application/x-www-form-urlencoded", \n "Host": "
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "data": "2"\n }, \n "headers": {\n "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n "Accept-Encoding": "gzip,deflate", \n "Accept-Language": "en", \n "Content-Length": "6", \n "Content-Type": "application/x-www-form-urlencoded", \n "Host": "
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "data": "4"\n }, \n "headers": {\n "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n "Accept-Encoding": "gzip,deflate", \n "Accept-Language": "en", \n "Content-Length": "6", \n "Content-Type": "application/x-www-form-urlencoded", \n "Host": "
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "data": "5"\n }, \n "headers": {\n "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n "Accept-Encoding": "gzip,deflate", \n "Accept-Language": "en", \n "Content-Length": "6", \n "Content-Type": "application/x-www-form-urlencoded", \n "Host": "
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "data": "6"\n }, \n "headers": {\n "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n "Accept-Encoding": "gzip,deflate", \n "Accept-Language": "en", \n "Content-Length": "6", \n "Content-Type": "application/x-www-form-urlencoded", \n "Host": "
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "data": "7"\n }, \n "headers": {\n "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n "Accept-Encoding": "gzip,deflate", \n "Accept-Language": "en", \n "Content-Length": "6", \n "Content-Type": "application/x-www-form-urlencoded", \n "Host": "
2019-05-06 16:31:09 [scrapy.extensions.logstats] INFO: Crawled 1001 pages (at 280 pages/min), scraped 0 items (at 0 items/min)
2019-05-06 16:32:09 [scrapy.extensions.logstats] INFO: Crawled 1001 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-05-06 16:33:09 [scrapy.extensions.logstats] INFO: Crawled 1001 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
至于數(shù)據(jù)重復(fù)的問題,如果POST的數(shù)據(jù)重復(fù),這個請求就不會發(fā)送出去。如果有特殊情況POST發(fā)送同樣的數(shù)據(jù)回得到不同返回值,添加dont_filter=True是沒用的,在RFPDupeFilter類中并沒考慮這個參數(shù),需要重寫。
總結(jié)
以上就是這篇文章的全部內(nèi)容了,希望本文的內(nèi)容對大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價值,謝謝大家對腳本之家的支持。
相關(guān)文章
Python中的sorted函數(shù)應(yīng)用及文件操作詳解
這篇文章主要介紹了Python中的sorted函數(shù)應(yīng)用及文件操作詳解,python只能將字符串寫入到文本文件,要將數(shù)值數(shù)據(jù)存儲到文本本件中,必須先試用函數(shù)str()將其轉(zhuǎn)換為字符串格式,需要的朋友可以參考下2023-12-12
Python中的遠程調(diào)試與性能優(yōu)化技巧分享
Python 是一種簡單易學(xué)、功能強大的編程語言,廣泛應(yīng)用于各種領(lǐng)域,包括網(wǎng)絡(luò)編程、數(shù)據(jù)分析、人工智能等,在開發(fā)過程中,我們經(jīng)常會遇到需要遠程調(diào)試和性能優(yōu)化的情況,本文將介紹如何利用遠程調(diào)試工具和性能優(yōu)化技巧來提高 Python 應(yīng)用程序的效率和性能2024-05-05
Python連接Oracle數(shù)據(jù)庫的操作指南
Oracle數(shù)據(jù)庫是一種強大的企業(yè)級關(guān)系數(shù)據(jù)庫管理系統(tǒng)(RDBMS),而Python是一門流行的編程語言,兩者的結(jié)合可以提供出色的數(shù)據(jù)管理和分析能力,本教程將詳細介紹如何在Python中連接Oracle數(shù)據(jù)庫,并演示常見的數(shù)據(jù)庫任務(wù),需要的朋友可以參考下2023-11-11
對django中render()與render_to_response()的區(qū)別詳解
今天小編就為大家分享一篇對django中render()與render_to_response()的區(qū)別詳解,具有很好的參考價值,希望對大家有所幫助。一起跟隨小編過來看看吧2018-10-10
python中random.randint和random.randrange的區(qū)別詳解
這篇文章主要介紹了python中random.randint和random.randrange的區(qū)別詳解,文中通過示例代碼介紹的非常詳細,對大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價值,需要的朋友們下面隨著小編來一起學(xué)習(xí)學(xué)習(xí)吧2020-09-09

