Python 爬蟲學習筆記之單線程爬蟲

更新時間：2016年09月21日 09:00:55 作者：千里追風

本文給大家分享的是python使用requests爬蟲庫實現(xiàn)單線程爬蟲的代碼以及requests庫的安裝和使用，有需要的小伙伴可以參考下

介紹

本篇文章主要介紹如何爬取麥子學院的課程信息（本爬蟲仍是單線程爬蟲），在開始介紹之前，先來看看結(jié)果示意圖

怎么樣，是不是已經(jīng)躍躍欲試了？首先讓我們打開麥子學院的網(wǎng)址，然后找到麥子學院的全部課程信息，像下面這樣

這個時候進行翻頁，觀看網(wǎng)址的變化，首先，第一頁的網(wǎng)址是 http://www.maiziedu.com/course/list/, 第二頁變成了 http://www.maiziedu.com/course/list/all-all/0-2/, 第三頁變成了 http://www.maiziedu.com/course/list/all-all/0-3/ ，可以看到，每次翻一頁，0后面的數(shù)字就會遞增1，然后就有人會想到了，拿第一頁呢？我們嘗試著將 http://www.maiziedu.com/course/list/all-all/0-1/ 放進瀏覽器的地址欄，發(fā)現(xiàn)可以打開第一欄，那就好辦了，我們只需要使用 re.sub() 就可以很輕松的獲取到任何一頁的內(nèi)容。獲取到網(wǎng)址鏈接之后，下面要做的就是獲取網(wǎng)頁的源代碼，首先右擊查看審查或者是檢查元素，就可以看到以下界面

找到課程所在的位置以后，就可以很輕松的利用正則表達式將我們需要的內(nèi)容提取出來，至于怎么提取，那就要靠你自己了，嘗試著自己去找規(guī)律才能有更大的收獲。如果你實在不知道怎么提取，那么繼續(xù)往下，看我的源代碼吧

實戰(zhàn)源代碼

 # coding=utf-8
 import re
 import requests
 import sys
 reload(sys)
 sys.setdefaultencoding("utf8")
 
 
 class spider():
   def __init__(self):
     print "開始爬取內(nèi)容。。。"
 
    def changePage(self, url, total_page):
     nowpage = int(re.search('/0-(\d+)/', url, re.S).group(1))
     pagegroup = []
 
     for i in range(nowpage, total_page + 1):
       link = re.sub('/0-(\d+)/', '/0-%s/' % i, url, re.S)
       pagegroup.append(link)
 
     return pagegroup
 
def getsource(self, url):
  html = requests.get(url)
  return html.text
 
def getclasses(self, source):
  classes = re.search('<ul class="zy_course_list">(.*?)</ul>', source, re.S).group(1)
  return classes
 
def geteach(self, classes):
  eachclasses = re.findall('<li>(.*?)</li>', classes, re.S)
  return eachclasses
 
def getinfo(self, eachclass):
  info = {}
  info['title'] = re.search('<a title="(.*?)"', eachclass, re.S).group(1)
  info['people'] = re.search('<p class="color99">(.*?)</p>', eachclass, re.S).group(1)
  return info
 
def saveinfo(self, classinfo):
  f = open('info.txt', 'a')
 
  for each in classinfo:
    f.writelines('title : ' + each['title'] + '\n')
    f.writelines('people : ' + each['people'] + '\n\n')
 
  f.close()
 
 
if __name__ == '__main__':
 
   classinfo = []
   url = 'http://www.maiziedu.com/course/list/all-all/0-1/'
   maizispider = spider()
   all_links = maizispider.changePage(url, 30)
   for each in all_links:
     htmlsources = maizispider.getsource(each)
     classes = maizispider.getclasses(htmlsources)
     eachclasses = maizispider.geteach(classes)
 
     for each in eachclasses:
       info = maizispider.getinfo(each)
       classinfo.append(info)
 
   maizispider.saveinfo(classinfo)

以上代碼并不難懂，基本就是正則表達式的使用，然后直接運行就可以看到開頭我們的截圖內(nèi)容了，由于這是單線程爬蟲，所以運行速度感覺有點慢，接下來還會繼續(xù)更新多線程爬蟲。

應小伙伴們的要求，下面附上requests爬蟲庫的安裝和簡單示例

首先安裝pip包管理工具,下載get-pip.py. 我的機器上安裝的既有python2也有python3。

安裝pip到python2：

python get-pip.py

安裝到python3：

python3 get-pip.py

pip安裝完成以后，安裝requests庫開啟python爬蟲學習。

安裝requests

pip3 install requests

我使用的python3，python2可以直接用pip install requests.

入門例子

import requests

html=requests.get("http://gupowang.baijia.baidu.com/article/283878")
html.encoding='utf-8'
print(html.text)

第一行引入requests庫，第二行使用requests的get方法獲取網(wǎng)頁源代碼，第三行設(shè)置編碼格式，第四行文本輸出。
把獲取到的網(wǎng)頁源代碼保存到文本文件中：

import requests
import os

html=requests.get("http://gupowang.baijia.baidu.com/article/283878")
html_file=open("news.txt","w")
html.encoding='utf-8'
print(html.text,file=html_file)

您可能感興趣的文章: