Python Scrapy實戰(zhàn)之古詩文網(wǎng)的爬取
需求
通過python,Scrapy框架,爬取古詩文網(wǎng)上的詩詞數(shù)據(jù),具體包括詩詞的標題信息,作者,朝代,詩詞內容,及譯文。爬取過程需要逐頁爬取,共4頁。第一頁的url為(https://www.gushiwen.cn/default_1.aspx)。

1. Scrapy項目創(chuàng)建
首先創(chuàng)建Scrapy項目及爬蟲程序
在目標目錄下,創(chuàng)建一個名為prose的項目:
scrapy startproject prose
進入項目目錄下,然后創(chuàng)建一個名為gs的爬蟲程序,爬取范圍為 gushiwen.cn
cd prose scrapy genspider gs gushiwen.cn
2. 全局配置 settings.py
對配置文件settings.py做如下編輯:
①選擇不遵守robots協(xié)議
②下載間隙設置為1
③并添加請求頭,啟用管道
④此外設置打印等級:LOG_LEVEL=“WARNING”
具體如下:
# Scrapy settings for prose project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'prose'
SPIDER_MODULES = ['prose.spiders']
NEWSPIDER_MODULE = 'prose.spiders'
LOG_LEVEL = "WARNING"
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'prose (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 1
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
}
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'prose.middlewares.ProseSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'prose.middlewares.ProseDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'prose.pipelines.ProsePipeline': 300,
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
3. 爬蟲程序.py
首先是進行頁面分析,這里不再贅述該過程。
這部分代碼,也即需要編輯的核心部分。
首先是要把初始URL加以修改,修改為要爬取的界面的第一頁,而非古詩文網(wǎng)的首頁。
需求:我們要爬取的內容包括:詩詞的標題信息,作者,朝代,詩詞內容,及譯文。爬取過程需要逐頁爬取。
其中,標題信息,作者,朝代,詩詞內容,及譯文都存在于同一個<div>標簽中。
為了體現(xiàn)兩種不同的操作方式,
標題信息,作者,朝代,詩詞內容 四項,我們使用一種方法獲取。并在該for循環(huán)中使用到一個異常處理語句(try…except…)來避免取到空值時使用索引導致的報錯;
對于譯文,我們額外定義一個parse_detail函數(shù),并在scrapy.Request()中傳入其,來獲取。
關于翻頁,我們的思路是:遍歷獲取完每一頁需要的數(shù)據(jù)后(即一大輪循環(huán)結束后),從當前頁面上獲取下一頁的鏈接,然后判斷獲取到的鏈接是否為空。如若不為空則表示獲取到了,則再一次使用scrapy.Requests()方法,傳入該鏈接,并再次調用parse函數(shù)。如果為空,則表明這已經(jīng)是最后一頁了,程序就會在此處結束。
具體代碼如下:
import scrapy
from prose.items import ProseItem
class GsSpider(scrapy.Spider):
name = 'gs'
allowed_domains = ['gushiwen.cn']
start_urls = ['https://www.gushiwen.cn/default_1.aspx']
# 解析列表頁面
def parse(self, response):
# 一個class="sons"對應的是一首詩
div_list = response.xpath('//div[@class="left"]/div[@class="sons"]')
for div in div_list:
try:
# 提取詩詞標題信息
title = div.xpath('.//b/text()').get()
# 提取作者和朝代
source = div.xpath('.//p[@class="source"]/a/text()').getall()
# 作者
# replace
author = source[0]
# 朝代
dynasty = source[1]
content_list = div.xpath('.//div[@class="contson"]//text()').getall()
content_plus = ''.join(content_list).strip()
# 拿到詩詞詳情頁面的url
detail_url = div.xpath('.//p/a/@href').get()
item = ProseItem(title=title, author=author, dynasty=dynasty, content_plus=content_plus, detail_url=detail_url)
# print(item)
yield scrapy.Request(
url=detail_url,
callback=self.parse_detail,
meta={'prose_item': item}
)
except:
pass
next_url = response.xpath('//a[@id="amore"]/@href').get()
if next_url:
print(next_url)
yield scrapy.Request(
url=next_url,
callback=self.parse
)
# 用于解析詳情頁面
def parse_detail(self, response):
item = response.meta.get('prose_item')
translation = response.xpath('//div[@class="sons"]/div[@class="contyishang"]/p//text()').getall()
item['translation'] = ''.join(translation).strip()
# print(item)
yield item
pass
4. 數(shù)據(jù)結構 items.py
在這里定義了ProseItem類,以便在上邊的爬蟲程序中調用。(此外要注意的是,爬蟲程序中導入了該模塊,有必要時需要將合適的文件夾標記為根目錄。)
import scrapy
class ProseItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
# 標題
title = scrapy.Field()
# 作者
author = scrapy.Field()
# 朝代
dynasty = scrapy.Field()
# 詩詞內容
content_plus = scrapy.Field()
# 詳情頁面的url
detail_url = scrapy.Field()
# 譯文
translation = scrapy.Field()
pass
5. 管道 pipelines.py
管道,在這里編輯數(shù)據(jù)存儲的過程。
from itemadapter import ItemAdapter
import json
class ProsePipeline:
def __init__(self):
self.f = open('gs.txt', 'w', encoding='utf-8')
def process_item(self, item, spider):
# 將item先轉化為字典, 再轉化為 json類型的字符串
item_json = json.dumps(dict(item), ensure_ascii=False)
self.f.write(item_json + '\n')
return item
def close_spider(self, spider):
self.f.close()
6. 程序執(zhí)行 start.py
定義一個執(zhí)行命令的程序。
from scrapy import cmdline
cmdline.execute('scrapy crawl gs'.split())
程序執(zhí)行效果如下:

我們需要的數(shù)據(jù),被保存在了一個名為gs.txt的文本文件中了。
以上就是Python Scrapy實戰(zhàn)之古詩文網(wǎng)的爬取的詳細內容,更多關于Python Scrapy爬取古詩文網(wǎng)的資料請關注腳本之家其它相關文章!
相關文章
python實現(xiàn)給scatter設置顏色漸變條colorbar的方法
今天小編就為大家分享一篇python實現(xiàn)給scatter設置顏色漸變條colorbar的方法,具有很好的參考價值,希望對大家有所幫助。一起跟隨小編過來看看吧2018-12-12
基于Python組裝jmx并調用JMeter實現(xiàn)壓力測試
這篇文章主要介紹了基于Python組裝jmx并調用JMeter實現(xiàn)壓力測試,文中通過示例代碼介紹的非常詳細,對大家的學習或者工作具有一定的參考學習價值,需要的朋友可以參考下2020-11-11

