Python scrapy爬取起點(diǎn)中文網(wǎng)小說榜單

更新時(shí)間：2021年06月13日 10:05:13 作者：超哥--

爬蟲的基礎(chǔ)內(nèi)容已經(jīng)全部學(xué)玩,博主決定想著更加標(biāo)準(zhǔn)化以及實(shí)用能力更強(qiáng)的scrapy進(jìn)發(fā),今天記錄自己第一個(gè)scrapy爬蟲項(xiàng)目. scrapy爬取起點(diǎn)中文網(wǎng)24小時(shí)熱銷榜單,需要的朋友可以參考下

一、項(xiàng)目需求

爬取排行榜小說的作者，書名，分類以及完結(jié)或連載

二、項(xiàng)目分析

目標(biāo)url：“https://www.qidian.com/rank/hotsales?style=1&page=1”

在這里插入圖片描述

通過控制臺(tái)搜索發(fā)現(xiàn)相應(yīng)信息均存在于html靜態(tài)網(wǎng)頁(yè)中，所以此次爬蟲難度較低。

在這里插入圖片描述

通過控制臺(tái)觀察發(fā)現(xiàn)，需要的內(nèi)容都在一個(gè)個(gè)li列表中，每一個(gè)列表代表一本書的內(nèi)容。

在這里插入圖片描述

在li中找到所需的內(nèi)容

在這里插入圖片描述

找到第兩頁(yè)的url
“https://www.qidian.com/rank/hotsales?style=1&page=1”
“https://www.qidian.com/rank/hotsales?style=1&page=2”
對(duì)比找到頁(yè)數(shù)變化
開始編寫scrapy程序。

三、程序編寫

創(chuàng)建項(xiàng)目太簡(jiǎn)單，不說了

1.編寫item（數(shù)據(jù)存儲(chǔ)）

import scrapy

class QidianHotItem(scrapy.Item):
    name = scrapy.Field() #名稱
    author = scrapy.Field() #作者
    type = scrapy.Field() #類型
    form= scrapy.Field() #是否完載

2.編寫spider（數(shù)據(jù)抓?。ê诵拇a））

#coding:utf-8

from scrapy import Request
from scrapy.spiders import Spider
from ..items import QidianHotItem
#導(dǎo)入下需要的庫(kù)

class HotSalesSpider(Spider):#設(shè)置spider的類
    name = "hot" #爬蟲的名稱
    qidian_header={"user-agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36"} #設(shè)置header
    current_page = 1 #爬蟲起始頁(yè)
    def start_requests(self): #重寫第一次請(qǐng)求
        url="https://www.qidian.com/rank/hotsales?style=1&page=1"
        yield Request(url,headers=self.qidian_header,callback=self.hot_parse)
		#Request發(fā)起鏈接請(qǐng)求
		#url：目標(biāo)url
		#header:設(shè)置頭部（模擬瀏覽器）
		#callback:設(shè)置頁(yè)面抓起方式（空默認(rèn)為parse）
    def hot_parse(self, response):#數(shù)據(jù)解析
        #xpath定位
        list_selector=response.xpath("http://div[@class='book-mid-info']")
        #獲取所有小說
        for one_selector in list_selector:
            #獲取小說信息
            name=one_selector.xpath("h4/a/text()").extract()[0]
            #獲取作者
            author=one_selector.xpath("p[1]/a[1]/text()").extract()[0]
            #獲取類型
            type=one_selector.xpath("p[1]/a[2]/text()").extract()[0]
            # 獲取形式
            form=one_selector.xpath("p[1]/span/text()").extract()[0]

            item = QidianHotItem()
            #生產(chǎn)存儲(chǔ)器，進(jìn)行信息存儲(chǔ)
            item['name'] = name
            item['author'] = author
            item['type'] = type
            item['form'] = form

            yield item #送出信息

            # 獲取下一頁(yè)URL，并生成一個(gè)request請(qǐng)求
            self.current_page += 1
            if self.current_page <= 10:#爬取前10頁(yè)
                next_url = "https://www.qidian.com/rank/hotsales?style=1&page="+str(self.current_page)
                yield Request(url=next_url,headers=self.qidian_header,callback=self.hot_parse)


    def css_parse(self,response):
        #css定位
        list_selector = response.css("[class='book-mid-info']")
        for one_selector in list_selector:
            # 獲取小說信息
            name = one_selector.css("h4>a::text").extract()[0]
            # 獲取作者
            author = one_selector.css(".author a::text").extract()[0]
            # 獲取類型
            type = one_selector.css(".author a::text").extract()[1]
            # 獲取形式
            form = one_selector.css(".author span::text").extract()[0]
            # 定義字典

            item=QidianHotItem()
            item['name']=name
            item['author'] = author
            item['type'] = type
            item['form'] = form
            yield  item

3.start.py（代替命令行）

在爬蟲項(xiàng)目文件夾下創(chuàng)建start.py。

在這里插入圖片描述

from scrapy import cmdline
#導(dǎo)入cmd命令窗口
cmdline.execute("scrapy crawl hot -o hot.csv" .split())
#運(yùn)行爬蟲并生產(chǎn)csv文件

出現(xiàn)類似的過程代表爬取成功。

在這里插入圖片描述

hot.csv

在這里插入圖片描述

總結(jié)

本次爬蟲內(nèi)容還是十分簡(jiǎn)單的因?yàn)橹挥昧藄pider和item，這幾乎是所有scrapy都必須調(diào)用的文件，后期還會(huì)有middlewarse.py，pipelines.py,setting.py需要編寫和配置，以及從javascript和json中提取數(shù)據(jù)，難度較大。

到此這篇關(guān)于Python scrapy爬取起點(diǎn)中文網(wǎng)小說榜單的文章就介紹到這了,更多相關(guān)Python爬取起點(diǎn)中文網(wǎng)內(nèi)容請(qǐng)搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家！

您可能感興趣的文章: