使用python?scrapy爬取天氣并導(dǎo)出csv文件
爬取xxx天氣
爬取網(wǎng)址:https://tianqi.2345.com/today-60038.htm
安裝
pip install scrapy
我使用的版本是scrapy 2.5
創(chuàng)建scray爬蟲項目
在命令行如下輸入命令
scrapy startproject name
name為項目名稱
如,scrapy startproject spider_weather
之后再輸入
scrapy genspider spider_name 域名
如,scrapy genspider changshu tianqi.2345.com
查看文件夾
- spider_weather
- spider
- __init__.py
- changshu.py
- __init__.py
- items.py
- middlewares.py
- pipelines.py
- settings.py
- scrapy.cfg

文件說明
| 名稱 | 作用 |
|---|---|
| scrapy.cfg | 項目的配置信息,主要為Scrapy命令行工具提供一個基礎(chǔ)的配置信息。(真正爬蟲相關(guān)的配置信息在settings.py文件中) |
| items.py | 設(shè)置數(shù)據(jù)存儲模板,用于結(jié)構(gòu)化數(shù)據(jù),如:Django的Model |
| pipelines | 數(shù)據(jù)處理行為,如:一般結(jié)構(gòu)化的數(shù)據(jù)持久化 |
| settings.py | 配置文件,如:遞歸的層數(shù)、并發(fā)數(shù),延遲下載等 |
| spiders | 爬蟲目錄,如:創(chuàng)建文件,編寫爬蟲規(guī)則 |
開始爬蟲
1.在spiders文件夾里面對自己創(chuàng)建的爬蟲文件進行數(shù)據(jù)爬取、如在此案例中的spiders/changshu.py
代碼演示如下
import scrapy
class ChangshuSpider(scrapy.Spider):
name = 'changshu'
allowed_domains = ['tianqi.2345.com']
start_urls = ['https://tianqi.2345.com/today-60038.htm']
def parse(self, response):
# 日期、天氣狀態(tài)、溫度、風(fēng)級
# 利用xpath解析數(shù)據(jù)、不會xpath的同學(xué)可以去稍微學(xué)習(xí)一下,語法簡單
dates = response.xpath('//a[@class="seven-day-item "]/em/text()').getall()
states = response.xpath('//a[@class="seven-day-item "]/i/text()').getall()
temps = response.xpath('//a[@class="seven-day-item "]/span[@class="tem-show"]/text()').getall()
winds = response.xpath('//a[@class="seven-day-item "]/span[@class="wind-name"]/text()').getall()
# 返回每條數(shù)據(jù)
for date, state, temp, wind in zip(dates,states,temps,winds):
yield {
'date' : date,
'state': state,
'temp': temp,
'wind': wind
}2.在settings.py文件中進行配置
修改UA
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'
修改機器爬蟲配置
ROBOTSTXT_OBEY = False
整個文件如下:
# Scrapy settings for spider_weather project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'spider_weather'
SPIDER_MODULES = ['spider_weather.spiders']
NEWSPIDER_MODULE = 'spider_weather.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'spider_weather.middlewares.SpiderWeatherSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'spider_weather.middlewares.SpiderWeatherDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# ITEM_PIPELINES = {
# 'spider_weather.pipelines.SpiderWeatherPipeline': 300,
# }
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'3.然后在命令行中輸入如下代碼
scrapy crawl changshu -o weather.csv
注意:需要進入spider_weather路徑下運行
scrapy crawl 文件名 -o weather.csv(導(dǎo)出文件)
4.結(jié)果如下

補充:scrapy導(dǎo)出csv時字段的一些問題
scrapy -o csv格式輸出的時候,發(fā)現(xiàn)輸出文件中字段的順序不是按照items.py中的順序,也不是爬蟲文件中寫入的順序,這樣導(dǎo)出的數(shù)據(jù)因為某些字段變得不好看,此外,導(dǎo)出得csv文件不同的item之間被空行隔開,本文主要描述解決這些問題的方法。
1.字段順序問題:
需要在scrapy的spiders同層目錄,新建csv_item_exporter.py文件內(nèi)容如下(文件名可改,目錄定死)
from scrapy.conf import settings from scrapy.contrib.exporter import CsvItemExporter class MyProjectCsvItemExporter(CsvItemExporter): def?init(self, *args, **kwargs): delimiter = settings.get(‘CSV_DELIMITER', ‘,') kwargs[‘delimiter'] = delimiter fields_to_export = settings.get(‘FIELDS_TO_EXPORT', []) if fields_to_export : kwargs[‘fields_to_export'] = fields_to_export super(MyProjectCsvItemExporter, self).init(*args, **kwargs)
2)在settings.py中新增以下內(nèi)容
#定義輸出格式
FEED_EXPORTERS = {
‘csv': ‘project_name.spiders.csv_item_exporter.MyProjectCsvItemExporter',
}
#指定csv輸出字段的順序
FIELDS_TO_EXPORT = [
‘name',
‘title',
‘info'
]
#指定分隔符
CSV_DELIMITER = ‘,'設(shè)定完畢,執(zhí)行scrapy crawl spider -o spider.csv的時候,字段就按順序來了
2.輸出csv有空行的問題
此時你可能會發(fā)現(xiàn)csv文件中有空行,這是因為scrapy默認(rèn)輸出時,每個item之間的分隔符是空行
解決辦法:
在找到exporters.py的CsvItemExporter類,大概在215行中增加newline="",即可。
也可以繼承重寫CsvItemExporter類
總結(jié)
到此這篇關(guān)于使用python scrapy爬取天氣并導(dǎo)出csv文件的文章就介紹到這了,更多相關(guān)scrapy爬取天氣導(dǎo)出csv內(nèi)容請搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家!
相關(guān)文章
Python使用pandasai實現(xiàn)數(shù)據(jù)分析
本文主要介紹了Python使用pandasai實現(xiàn)數(shù)據(jù)分析,文中通過示例代碼介紹的非常詳細(xì),對大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價值,需要的朋友們下面隨著小編來一起學(xué)習(xí)學(xué)習(xí)吧2023-06-06
pandas去重復(fù)行并分類匯總的實現(xiàn)方法
這篇文章主要介紹了pandas去重復(fù)行并分類匯總的實現(xiàn)方法,文中通過示例代碼介紹的非常詳細(xì),對大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價值,需要的朋友們下面隨著小編來一起學(xué)習(xí)學(xué)習(xí)吧2019-01-01
深度學(xué)習(xí)Tensorflow2.8實現(xiàn)GRU文本生成任務(wù)詳解
這篇文章主要為大家介紹了深度學(xué)習(xí)Tensorflow?2.8?實現(xiàn)?GRU?文本生成任務(wù)示例詳解,有需要的朋友可以借鑒參考下,希望能夠有所幫助,祝大家多多進步,早日升職加薪2023-01-01
如何徹底解決python?NameError:name?'__file__'?is?not?
這篇文章主要給大家介紹了關(guān)于如何徹底解決python?NameError:name?'__file__'?is?not?defined的相關(guān)資料,文中通過圖文將解決的辦法介紹的非常詳細(xì),需要的朋友可以參考下2023-02-02
利用Django框架中select_related和prefetch_related函數(shù)對數(shù)據(jù)庫查詢優(yōu)化
這篇文章主要介紹了利用Python的Django框架中select_related和prefetch_related函數(shù)對數(shù)據(jù)庫查詢的優(yōu)化的一個實踐例子,展示如何在實際中利用這兩個函數(shù)減少對數(shù)據(jù)庫的查詢次數(shù),需要的朋友可以參考下2015-04-04
pytorch實現(xiàn)加載保存查看checkpoint文件
這篇文章主要介紹了pytorch實現(xiàn)加載保存查看checkpoint文件方式,具有很好的參考價值,希望對大家有所幫助。如有錯誤或未考慮完全的地方,望不吝賜教2022-07-07
解決django無法訪問本地static文件(js,css,img)網(wǎng)頁里js,cs都加載不了
這篇文章主要介紹了解決django無法訪問本地static文件(js,css,img)網(wǎng)頁里js,cs都加載不了的問題,具有很好的參考價值,希望對大家有所幫助。一起跟隨小編過來看看吧2020-04-04

