scrapy spider的幾種爬取方式實(shí)例代碼
本節(jié)課介紹了scrapy的爬蟲框架,重點(diǎn)說(shuō)了scrapy組件spider。
spider的幾種爬取方式:
- 爬取1頁(yè)內(nèi)容
- 按照給定列表拼出鏈接爬取多頁(yè)
- 找到‘下一頁(yè)'標(biāo)簽進(jìn)行爬取
- 進(jìn)入鏈接,按照鏈接進(jìn)行爬取
下面分別給出了示例
1.爬取1頁(yè)內(nèi)容
#by 寒小陽(yáng)(hanxiaoyang.ml@gmail.com)
import scrapy
class JulyeduSpider(scrapy.Spider):
name = "julyedu"
start_urls = [
'https://www.julyedu.com/category/index',
]
def parse(self, response):
for julyedu_class in response.xpath('//div[@class="course_info_box"]'):
print julyedu_class.xpath('a/h4/text()').extract_first()
print julyedu_class.xpath('a/p[@class="course-info-tip"][1]/text()').extract_first()
print julyedu_class.xpath('a/p[@class="course-info-tip"][2]/text()').extract_first()
print response.urljoin(julyedu_class.xpath('a/img[1]/@src').extract_first())
print "\n"
yield {
'title':julyedu_class.xpath('a/h4/text()').extract_first(),
'desc': julyedu_class.xpath('a/p[@class="course-info-tip"][1]/text()').extract_first(),
'time': julyedu_class.xpath('a/p[@class="course-info-tip"][2]/text()').extract_first(),
'img_url': response.urljoin(julyedu_class.xpath('a/img[1]/@src').extract_first())
}
2.按照給定列表拼出鏈接爬取多頁(yè)
#by 寒小陽(yáng)(hanxiaoyang.ml@gmail.com)
import scrapy
class CnBlogSpider(scrapy.Spider):
name = "cnblogs"
allowed_domains = ["cnblogs.com"]
start_urls = [
'http://www.cnblogs.com/pick/#p%s' % p for p in xrange(1, 11)
]
def parse(self, response):
for article in response.xpath('//div[@class="post_item"]'):
print article.xpath('div[@class="post_item_body"]/h3/a/text()').extract_first().strip()
print response.urljoin(article.xpath('div[@class="post_item_body"]/h3/a/@href').extract_first()).strip()
print article.xpath('div[@class="post_item_body"]/p/text()').extract_first().strip()
print article.xpath('div[@class="post_item_body"]/div[@class="post_item_foot"]/a/text()').extract_first().strip()
print response.urljoin(article.xpath('div[@class="post_item_body"]/div/a/@href').extract_first()).strip()
print article.xpath('div[@class="post_item_body"]/div[@class="post_item_foot"]/span[@class="article_comment"]/a/text()').extract_first().strip()
print article.xpath('div[@class="post_item_body"]/div[@class="post_item_foot"]/span[@class="article_view"]/a/text()').extract_first().strip()
print ""
yield {
'title': article.xpath('div[@class="post_item_body"]/h3/a/text()').extract_first().strip(),
'link': response.urljoin(article.xpath('div[@class="post_item_body"]/h3/a/@href').extract_first()).strip(),
'summary': article.xpath('div[@class="post_item_body"]/p/text()').extract_first().strip(),
'author': article.xpath('div[@class="post_item_body"]/div[@class="post_item_foot"]/a/text()').extract_first().strip(),
'author_link': response.urljoin(article.xpath('div[@class="post_item_body"]/div/a/@href').extract_first()).strip(),
'comment': article.xpath('div[@class="post_item_body"]/div[@class="post_item_foot"]/span[@class="article_comment"]/a/text()').extract_first().strip(),
'view': article.xpath('div[@class="post_item_body"]/div[@class="post_item_foot"]/span[@class="article_view"]/a/text()').extract_first().strip(),
}
3.找到‘下一頁(yè)'標(biāo)簽進(jìn)行爬取
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/tag/humor/',
]
def parse(self, response):
for quote in response.xpath('//div[@class="quote"]'):
yield {
'text': quote.xpath('span[@class="text"]/text()').extract_first(),
'author': quote.xpath('span/small[@class="author"]/text()').extract_first(),
}
next_page = response.xpath('//li[@class="next"]/@herf').extract_first()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
4.進(jìn)入鏈接,按照鏈接進(jìn)行爬取
#by 寒小陽(yáng)(hanxiaoyang.ml@gmail.com)
import scrapy
class QQNewsSpider(scrapy.Spider):
name = 'qqnews'
start_urls = ['http://news.qq.com/society_index.shtml']
def parse(self, response):
for href in response.xpath('//*[@id="news"]/div/div/div/div/em/a/@href'):
full_url = response.urljoin(href.extract())
yield scrapy.Request(full_url, callback=self.parse_question)
def parse_question(self, response):
print response.xpath('//div[@class="qq_article"]/div/h1/text()').extract_first()
print response.xpath('//span[@class="a_time"]/text()').extract_first()
print response.xpath('//span[@class="a_catalog"]/a/text()').extract_first()
print "\n".join(response.xpath('//div[@id="Cnt-Main-Article-QQ"]/p[@class="text"]/text()').extract())
print ""
yield {
'title': response.xpath('//div[@class="qq_article"]/div/h1/text()').extract_first(),
'content': "\n".join(response.xpath('//div[@id="Cnt-Main-Article-QQ"]/p[@class="text"]/text()').extract()),
'time': response.xpath('//span[@class="a_time"]/text()').extract_first(),
'cate': response.xpath('//span[@class="a_catalog"]/a/text()').extract_first(),
}
總結(jié)
以上就是本文關(guān)于scrapy spider的幾種爬取方式實(shí)例代碼的全部?jī)?nèi)容,希望對(duì)大家有所幫助。感興趣的朋友可以繼續(xù)參閱本站其他相關(guān)專題,如有不足之處,歡迎留言指出。感謝朋友們對(duì)本站的支持!
相關(guān)文章
python curl2pyreqs 生成接口腳本實(shí)戰(zhàn)教程
這篇文章主要介紹了python curl2pyreqs 生成接口腳本實(shí)戰(zhàn)教程,首先下載 curl2pyreqs 庫(kù),打開調(diào)試模式,在Network這里獲取接口的cURL,需要的朋友可以參考下2023-10-10
Python matplotlib 繪制雙Y軸曲線圖的示例代碼
Matplotlib是非常強(qiáng)大的python畫圖工具,這篇文章主要介紹了Python matplotlib 繪制雙Y軸曲線圖,本文給大家介紹的非常詳細(xì),對(duì)大家的學(xué)習(xí)或工作具有一定的參考借鑒價(jià)值,需要的朋友可以參考下2020-06-06
對(duì)pytorch中不定長(zhǎng)序列補(bǔ)齊的操作
這篇文章主要介紹了對(duì)pytorch中不定長(zhǎng)序列補(bǔ)齊的操作,具有很好的參考價(jià)值,希望對(duì)大家有所幫助。如有錯(cuò)誤或未考慮完全的地方,望不吝賜教2021-05-05
Spectral?clustering譜聚類算法的實(shí)現(xiàn)代碼
譜聚類是從圖論中演化出來(lái)的算法,它的主要思想是把所有的數(shù)據(jù)看做空間中的點(diǎn),這些點(diǎn)之間可以用邊連接起來(lái),今天通過(guò)本文給大家介紹Spectral?clustering譜聚類算法的實(shí)現(xiàn),感興趣的朋友一起看看吧2022-04-04

