python使用rabbitmq實現(xiàn)網(wǎng)絡爬蟲示例

更新時間：2014年02月20日 09:47:31 作者：

這篇文章主要介紹了python使用RabbitMQ實現(xiàn)網(wǎng)絡爬蟲的示例,需要的朋友可以參考下

編寫tasks.py

from celery import Celery
from tornado.httpclient import HTTPClient
app = Celery('tasks')
app.config_from_object('celeryconfig')
@app.task
def get_html(url):
    http_client = HTTPClient()
    try:
        response = http_client.fetch(url,follow_redirects=True)
        return response.body
    except httpclient.HTTPError as e:
        return None
    http_client.close()

編寫celeryconfig.py

復制代碼代碼如下:

CELERY_IMPORTS = ('tasks',)
BROKER_URL = 'amqp://guest@localhost:5672//'
CELERY_RESULT_BACKEND = 'amqp://'

編寫spider.py

復制代碼代碼如下:

from tasks import get_html
from queue import Queue
from bs4 import BeautifulSoup
from urllib.parse import urlparse,urljoin
import threading
class spider(object):
    def __init__(self):
        self.visited={}
        self.queue=Queue()
    def process_html(self, html):
        pass
        #print(html)
    def _add_links_to_queue(self,url_base,html):
        soup = BeautifulSoup(html)
        links=soup.find_all('a')
        for link in links:
            try:
                url=link['href']
            except:
                pass
            else:
                url_com=urlparse(url)
                if not url_com.netloc:
                    self.queue.put(urljoin(url_base,url))
                else:
                    self.queue.put(url_com.geturl())
    def start(self,url):
        self.queue.put(url)
        for i in range(20):
            t = threading.Thread(target=self._worker)
            t.daemon = True
            t.start()
        self.queue.join()
    def _worker(self):
        while 1:
            url=self.queue.get()
            if url in self.visited:
                continue
            else:
                result=get_html.delay(url)
                try:
                    html=result.get(timeout=5)
                except Exception as e:
                    print(url)
                    print(e)
                self.process_html(html)
                self._add_links_to_queue(url,html)

self.visited[url]=True
self.queue.task_done()
s=spider()
s.start("http://www.dhdzp.com/")

由于html中某些特殊情況的存在，程序還有待完善。

您可能感興趣的文章:

相關(guān)文章

Python實現(xiàn)讀取HTML表格 pd.read_html()
這篇文章主要介紹了Python實現(xiàn)讀取HTML表格 pd.read_html()，具有很好的參考價值，希望對大家有所幫助。如有錯誤或未考慮完全的地方，望不吝賜教
2022-07-07
pytorch加載的cifar10數(shù)據(jù)集過程詳解
這篇文章主要介紹了pytorch加載的cifar10數(shù)據(jù)集,到底有沒有經(jīng)過歸一化,本文對這一問題給大家介紹的非常詳細,對大家的學習或工作具有一定的參考借鑒價值,需要的朋友參考下吧
2023-11-11
Python的collections模塊中的OrderedDict有序字典
字典是無序的,但是collections的OrderedDict類為我們提供了一個有序的字典結(jié)構(gòu),名副其實的Ordered+Dict,下面通過兩個例子來簡單了解下Python的collections模塊中的OrderedDict有序字典:
2016-07-07
源碼解析python中randint函數(shù)的效率缺陷
這篇文章主要介紹了源碼解析python中randint函數(shù)的效率缺陷，通過討論?random?模塊的實現(xiàn)，并討論了一些更為快速的生成偽隨機整數(shù)的替代方法展開主題，需要的盆友可以參考一下
2022-06-06
如何用Python繪制簡易動態(tài)圣誕樹
這篇文章主要給大家介紹了關(guān)于如何用Python繪制簡易動態(tài)圣誕樹,文中講解了如何通過編寫代碼來實現(xiàn)特定的效果,包括代碼的編寫技巧和效果的展示,需要的朋友可以參考下
2025-01-01
Python+tkinter使用80行代碼實現(xiàn)一個計算器實例
這篇文章主要介紹了Python+tkinter使用80行代碼實現(xiàn)一個計算器實例，具有一定借鑒價值,需要的朋友可以參考下
2018-01-01
Python深入分析@property裝飾器的應用
這篇文章主要介紹了Python @property裝飾器的用法，在Python中，可以通過@property裝飾器將一個方法轉(zhuǎn)換為屬性，從而實現(xiàn)用于計算的屬性，下面文章圍繞主題展開更多相關(guān)詳情，感興趣的小伙伴可以參考一下
2022-07-07
Python中sort和sorted函數(shù)代碼解析
這篇文章主要介紹了Python中sort和sorted函數(shù)代碼解析，小編覺得還是挺不錯的，具有一定借鑒價值，需要的朋友可以參考下
2018-01-01
Python正則re模塊使用步驟及原理解析
這篇文章主要介紹了Python正則re模塊使用步驟及原理解析,文中通過示例代碼介紹的非常詳細，對大家的學習或者工作具有一定的參考學習價值,需要的朋友可以參考下
2020-08-08
Python3+Django get/post請求實現(xiàn)教程詳解
這篇文章主要介紹了Python3+Django get/post請求實現(xiàn)教程詳解,需要的朋友可以參考下
2021-02-02