Python并發(fā)爬蟲常用實(shí)現(xiàn)方法解析

更新時(shí)間：2020年11月19日 15:22:57 作者：迎風(fēng)而來

這篇文章主要介紹了Python并發(fā)爬蟲常用實(shí)現(xiàn)方法解析,文中通過示例代碼介紹的非常詳細(xì)，對(duì)大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價(jià)值,需要的朋友可以參考下

在進(jìn)行單個(gè)爬蟲抓取的時(shí)候，我們不可能按照一次抓取一個(gè)url的方式進(jìn)行網(wǎng)頁抓取，這樣效率低，也浪費(fèi)了cpu的資源。目前python上面進(jìn)行并發(fā)抓取的實(shí)現(xiàn)方式主要有以下幾種：進(jìn)程，線程，協(xié)程。進(jìn)程不在的討論范圍之內(nèi)，一般來說，進(jìn)程是用來開啟多個(gè)spider，比如我們開啟了4進(jìn)程，同時(shí)派發(fā)4個(gè)spider進(jìn)行網(wǎng)絡(luò)抓取，每個(gè)spider同時(shí)抓取4個(gè)url。

所以，我們今天討論的是，在單個(gè)爬蟲的情況下，盡可能的在同一個(gè)時(shí)間并發(fā)抓取，并且抓取的效率要高。

一.順序抓取

順序抓取是最最常見的抓取方式，一般初學(xué)爬蟲的朋友就是利用這種方式，下面是一個(gè)測(cè)試代碼，順序抓取8個(gè)url，我們可以來測(cè)試一下抓取完成需要多少時(shí)間：

HEADERS = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9',         
  'Accept-Language': 'zh-CN,zh;q=0.8',                            
  'Accept-Encoding': 'gzip, deflate',}                            
URLS = ['http://www.cnblogs.com/moodlxs/p/3248890.html',                   
    'https://www.zhihu.com/topic/19804387/newest',                    
    'http://blog.csdn.net/yueguanghaidao/article/details/24281751',            
    'https://my.oschina.net/visualgui823/blog/36987',                   
    'http://blog.chinaunix.net/uid-9162199-id-4738168.html',               
    'http://www.tuicool.com/articles/u67Bz26',                      
    'http://rfyiamcool.blog.51cto.com/1030776/1538367/',                 
    'http://itindex.net/detail/26512-flask-tornado-gevent']                
                                               
#url為隨機(jī)獲取的一批url                                        
                                               
def func():                                          
  """                                            
  順序抓取                                           
  """                                            
  import requests                                      
  import time                                        
  urls = URLS                                        
  headers = HEADERS                                     
  headers['user-agent'] = "Mozilla/5.0+(Windows+NT+6.2;+WOW64)+AppleWebKit/537" \      
              ".36+(KHTML,+like+Gecko)+Chrome/45.0.2454.101+Safari/537.36"   
  print(u'順序抓取')                                      
  starttime= time.time()                                  
  for url in urls:                                     
    try:                                         
      r = requests.get(url, allow_redirects=False, timeout=2.0, headers=headers)    
    except:                                        
      pass                                       
    else:                                         
      print(r.status_code, r.url)                            
  endtime=time.time()                                    
  print(endtime-starttime)                                 
                                               
func()

我們直接采用內(nèi)建的time.time()來計(jì)時(shí)，較為粗略，但可以反映大概的情況。下面是順序抓取的結(jié)果計(jì)時(shí)：

可以從圖片中看到，顯示的順序與urls的順序是一模一樣的，總共耗時(shí)為7.763269901275635秒，一共8個(gè)url，平均抓取一個(gè)大概需要0.97秒。總體來看，還可以接受。

二.多線程抓取

線程是python內(nèi)的一種較為不錯(cuò)的并發(fā)方式，我們也給出相應(yīng)的代碼，并且為每個(gè)url創(chuàng)建了一個(gè)線程，一共8線程并發(fā)抓取，下面的代碼：

下面是我們運(yùn)行8線程的測(cè)試代碼：

HEADERS = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9',               
  'Accept-Language': 'zh-CN,zh;q=0.8',                                  
  'Accept-Encoding': 'gzip, deflate',}                                  
URLS = ['http://www.cnblogs.com/moodlxs/p/3248890.html',                          
    'https://www.zhihu.com/topic/19804387/newest',                           
    'http://blog.csdn.net/yueguanghaidao/article/details/24281751',                  
    'https://my.oschina.net/visualgui823/blog/36987',                         
    'http://blog.chinaunix.net/uid-9162199-id-4738168.html',                      
    'http://www.tuicool.com/articles/u67Bz26',                             
    'http://rfyiamcool.blog.51cto.com/1030776/1538367/',                        
    'http://itindex.net/detail/26512-flask-tornado-gevent']                      
                                                      
def thread():                                               
  from threading import Thread                                      
  import requests                                            
  import time                                              
  urls = URLS                                              
  headers = HEADERS                                           
  headers['user-agent'] = "Mozilla/5.0+(Windows+NT+6.2;+WOW64)+AppleWebKit/537.36+" \          
              "(KHTML,+like+Gecko)+Chrome/45.0.2454.101+Safari/537.36"            
  def get(url):                                             
    try:                                                
      r = requests.get(url, allow_redirects=False, timeout=2.0, headers=headers)           
    except:                                              
      pass                                              
    else:                                               
      print(r.status_code, r.url)                                  
                                                      
  print(u'多線程抓取')                                            
  ts = [Thread(target=get, args=(url,)) for url in urls]                         
  starttime= time.time()                                         
  for t in ts:                                              
    t.start()                                             
  for t in ts:                                              
    t.join()                                              
  endtime=time.time()                                          
  print(endtime-starttime)                                        
thread()

多線程抓住的時(shí)間如下：

可以看到相較于順序抓取，8線程的抓取效率明顯上升了3倍多，全部完成只消耗了2.154秒?？梢钥吹斤@示的結(jié)果已經(jīng)不是urls的順序了，說明每個(gè)url各自完成的時(shí)間都是不一樣的。線程就是在一個(gè)進(jìn)程中不斷的切換，讓每個(gè)線程各自運(yùn)行一會(huì)，這對(duì)于網(wǎng)絡(luò)io來說，性能是非常高的。但是線程之間的切換是挺浪費(fèi)資源的。

三.gevent并發(fā)抓取

gevent是一種輕量級(jí)的協(xié)程，可用它來代替線程，而且，他是在一個(gè)線程中運(yùn)行，機(jī)器資源的損耗比線程低很多。如果遇到了網(wǎng)絡(luò)io阻塞，會(huì)馬上切換到另一個(gè)程序中去運(yùn)行，不斷的輪詢，來降低抓取的時(shí)間
下面是測(cè)試代碼：

HEADERS = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9',
  'Accept-Language': 'zh-CN,zh;q=0.8',
  'Accept-Encoding': 'gzip, deflate',}

URLS = ['http://www.cnblogs.com/moodlxs/p/3248890.html',
    'https://www.zhihu.com/topic/19804387/newest',
    'http://blog.csdn.net/yueguanghaidao/article/details/24281751',
    'https://my.oschina.net/visualgui823/blog/36987',
    'http://blog.chinaunix.net/uid-9162199-id-4738168.html',
    'http://www.tuicool.com/articles/u67Bz26',
    'http://rfyiamcool.blog.51cto.com/1030776/1538367/',
    'http://itindex.net/detail/26512-flask-tornado-gevent']

def main():
  """
  gevent并發(fā)抓取
  """
  import requests
  import gevent
  import time

  headers = HEADERS
  headers['user-agent'] = "Mozilla/5.0+(Windows+NT+6.2;+WOW64)+AppleWebKit/537.36+" \
              "(KHTML,+like+Gecko)+Chrome/45.0.2454.101+Safari/537.36"
  urls = URLS
  def get(url):
    try:
      r = requests.get(url, allow_redirects=False, timeout=2.0, headers=headers)
    except:
      pass
    else:
      print(r.status_code, r.url)

  print(u'基于gevent的并發(fā)抓取')
  starttime= time.time()
  g = [gevent.spawn(get, url) for url in urls]
  gevent.joinall(g)
  endtime=time.time()
  print(endtime - starttime)
main()

協(xié)程的抓取時(shí)間如下：

正常情況下，gevent的并發(fā)抓取與多線程的消耗時(shí)間差不了多少，但是可能是我網(wǎng)絡(luò)的原因，或者機(jī)器的性能的原因，時(shí)間有點(diǎn)長(zhǎng)......,請(qǐng)各位小主在自己電腦進(jìn)行跑一下看運(yùn)行時(shí)間

四.基于tornado的coroutine并發(fā)抓取

tornado中的coroutine是python中真正意義上的協(xié)程，與python3中的asyncio幾乎是完全一樣的，而且兩者之間的future是可以相互轉(zhuǎn)換的，tornado中有與asyncio相兼容的接口。
下面是利用tornado中的coroutine進(jìn)行并發(fā)抓取的代碼：

利用coroutine編寫并發(fā)略顯復(fù)雜，但這是推薦的寫法，如果你使用的是python3，強(qiáng)烈建議你使用coroutine來編寫并發(fā)抓取。

下面是測(cè)試代碼：

HEADERS = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9',
  'Accept-Language': 'zh-CN,zh;q=0.8',
  'Accept-Encoding': 'gzip, deflate',}

URLS = ['http://www.cnblogs.com/moodlxs/p/3248890.html',
    'https://www.zhihu.com/topic/19804387/newest',
    'http://blog.csdn.net/yueguanghaidao/article/details/24281751',
    'https://my.oschina.net/visualgui823/blog/36987',
    'http://blog.chinaunix.net/uid-9162199-id-4738168.html',
    'http://www.tuicool.com/articles/u67Bz26',
    'http://rfyiamcool.blog.51cto.com/1030776/1538367/',
    'http://itindex.net/detail/26512-flask-tornado-gevent']
import time
from tornado.gen import coroutine
from tornado.ioloop import IOLoop
from tornado.httpclient import AsyncHTTPClient, HTTPError
from tornado.httpclient import HTTPRequest

#urls與前面相同
class MyClass(object):

  def __init__(self):
    #AsyncHTTPClient.configure("tornado.curl_httpclient.CurlAsyncHTTPClient")
    self.http = AsyncHTTPClient()

  @coroutine
  def get(self, url):
    #tornado會(huì)自動(dòng)在請(qǐng)求首部帶上host首部
    request = HTTPRequest(url=url,
              method='GET',
              headers=HEADERS,
              connect_timeout=2.0,
              request_timeout=2.0,
              follow_redirects=False,
              max_redirects=False,
              user_agent="Mozilla/5.0+(Windows+NT+6.2;+WOW64)+AppleWebKit/537.36+\
              (KHTML,+like+Gecko)+Chrome/45.0.2454.101+Safari/537.36",)
    yield self.http.fetch(request, callback=self.find, raise_error=False)

  def find(self, response):
    if response.error:
      print(response.error)
    print(response.code, response.effective_url, response.request_time)


class Download(object):

  def __init__(self):
    self.a = MyClass()
    self.urls = URLS

  @coroutine
  def d(self):
    print(u'基于tornado的并發(fā)抓取')
    starttime = time.time()
    yield [self.a.get(url) for url in self.urls]
    endtime=time.time()
    print(endtime-starttime)

if __name__ == '__main__':
  dd = Download()
  loop = IOLoop.current()
  loop.run_sync(dd.d)

抓取的時(shí)間如下：

可以看到總共花費(fèi)了128087秒，而這所花費(fèi)的時(shí)間恰恰就是最后一個(gè)url抓取所需要的時(shí)間，tornado中自帶了查看每個(gè)請(qǐng)求的相應(yīng)時(shí)間。我們可以從圖中看到，最后一個(gè)url抓取總共花了1.28087秒，相較于其他時(shí)間大大的增加，這也是導(dǎo)致我們消耗時(shí)間過長(zhǎng)的原因。那可以推斷出，前面的并發(fā)抓取，也在這個(gè)url上花費(fèi)了較多的時(shí)間。

總結(jié)：

以上測(cè)試其實(shí)非常的不嚴(yán)謹(jǐn)，因?yàn)槲覀冞x取的url的數(shù)量太少了，完全不能反映每一種抓取方式的優(yōu)劣。如果有一萬個(gè)不同的url同時(shí)抓取，那么記下總抓取時(shí)間，是可以得出一個(gè)較為客觀的結(jié)果的。

并且，已經(jīng)有人測(cè)試過，多線程抓取的效率是遠(yuǎn)不如gevent的。所以，如果你使用的是python2，那么我推薦你使用gevent進(jìn)行并發(fā)抓?。蝗绻闶褂玫氖莗ython3，我推薦你使用tornado的http客戶端結(jié)合coroutine進(jìn)行并發(fā)抓取。從上面的結(jié)果來看，tornado的coroutine是高于gevent的輕量級(jí)的協(xié)程的。但具體結(jié)果怎樣，我沒測(cè)試過。

以上就是本文的全部?jī)?nèi)容，希望對(duì)大家的學(xué)習(xí)有所幫助，也希望大家多多支持腳本之家。

您可能感興趣的文章: