python定向爬蟲校園論壇帖子信息

更新時間：2018年07月23日 14:13:43 作者：lannooooooooooo

這篇文章主要為大家詳細介紹了Python定向爬蟲校園論壇帖子信息的相關方法，具有一定的參考價值，感興趣的小伙伴們可以參考一下

引言

寫這個小爬蟲主要是為了爬校園論壇上的實習信息，主要采用了Requests庫

源碼

URLs.py

主要功能是根據(jù)一個初始url（包含page頁面參數(shù)）來獲得page頁面從當前頁面數(shù)到pageNum的url列表

import re

def getURLs(url, attr, pageNum=1):
  all_links = []
  try:
    now_page_number = int(re.search(attr+'=(\d+)', url, re.S).group(1))
    for i in range(now_page_number, pageNum + 1):
      new_url = re.sub(attr+'=\d+', attr+'=%s' % i, url, re.S)
      all_links.append(new_url)
    return all_links
  except TypeError:
    print "arguments TypeError:attr should be string."

uni_2_native.py

由于論壇上爬取得到的網(wǎng)頁上的中文都是unicode編碼的形式，文本格式都為 &#XXXX;的形式，所以在爬得網(wǎng)站內容后還需要對其進行轉換

import sys
import re
reload(sys)
sys.setdefaultencoding('utf-8')

def get_native(raw):
  tostring = raw
  while True:
    obj = re.search('&#(.*?);', tostring, flags=re.S)
    if obj is None:
      break
    else:
      raw, code = obj.group(0), obj.group(1)
      tostring = re.sub(raw, unichr(int(code)), tostring)
  return tostring

存入SQLite數(shù)據(jù)庫：saveInfo.py

# -*- coding: utf-8 -*-

import MySQLdb


class saveSqlite():
  def __init__(self):
    self.infoList = []

  def saveSingle(self, author=None, title=None, date=None, url=None,reply=0, view=0):
    if author is None or title is None or date is None or url is None:
      print "No info saved!"
    else:
      singleDict = {}
      singleDict['author'] = author
      singleDict['title'] = title
      singleDict['date'] = date
      singleDict['url'] = url
      singleDict['reply'] = reply
      singleDict['view'] = view
      self.infoList.append(singleDict)

  def toMySQL(self):
    conn = MySQLdb.connect(host='localhost', user='root', passwd='', port=3306, db='db_name', charset='utf8')
    cursor = conn.cursor()
    # sql = "select * from info"
    # n = cursor.execute(sql)
    # for row in cursor.fetchall():
    #   for r in row:
    #     print r
    #   print '\n'
    sql = "delete from info"
    cursor.execute(sql)
    conn.commit()

    sql = "insert into info(title,author,url,date,reply,view) values (%s,%s,%s,%s,%s,%s)"
    params = []
    for each in self.infoList:
      params.append((each['title'], each['author'], each['url'], each['date'], each['reply'], each['view']))
    cursor.executemany(sql, params)

    conn.commit()
    cursor.close()
    conn.close()


  def show(self):
    for each in self.infoList:
      print "author: "+each['author']
      print "title: "+each['title']
      print "date: "+each['date']
      print "url: "+each['url']
      print "reply: "+str(each['reply'])
      print "view: "+str(each['view'])
      print '\n'

if __name__ == '__main__':
  save = saveSqlite()
  save.saveSingle('網(wǎng)','aaa','2008-10-10 10:10:10','www.baidu.com',1,1)
  # save.show()
  save.toMySQL()

主要爬蟲代碼

import requests
from lxml import etree
from cc98 import uni_2_native, URLs, saveInfo

# 根據(jù)自己所需要爬的網(wǎng)站，偽造一個header
headers ={
  'Accept': '',
  'Accept-Encoding': '',
  'Accept-Language': '',
  'Connection': '',
  'Cookie': '',
  'Host': '',
  'Referer': '',
  'Upgrade-Insecure-Requests': '',
  'User-Agent': ''
}
url = 'http://www.cc98.org/list.asp?boardid=459&page=1&action='
cc98 = 'http://www.cc98.org/'

print "get infomation from cc98..."

urls = URLs.getURLs(url, "page", 50)
savetools = saveInfo.saveSqlite()

for url in urls:
  r = requests.get(url, headers=headers)
  html = uni_2_native.get_native(r.text)

  selector = etree.HTML(html)
  content_tr_list = selector.xpath('//form/table[@class="tableborder1 list-topic-table"]/tbody/tr')

  for each in content_tr_list:
    href = each.xpath('./td[2]/a/@href')
    if len(href) == 0:
      continue
    else:
      # print len(href)
      # not very well using for, though just one element in list
      # but I don't know why I cannot get the data by index
      for each_href in href:
        link = cc98 + each_href
      title_author_time = each.xpath('./td[2]/a/@title')

      # print len(title_author_time)
      for info in title_author_time:
        info_split = info.split('\n')
        title = info_split[0][1:len(info_split[0])-1]
        author = info_split[1][3:]
        date = info_split[2][3:]

      hot = each.xpath('./td[4]/text()')
      # print len(hot)
      for hot_num in hot:
        reply_view = hot_num.strip().split('/')
        reply, view = reply_view[0], reply_view[1]
      savetools.saveSingle(author=author, title=title, date=date, url=link, reply=reply, view=view)

print "All got! Now saving to Database..."
# savetools.show()
savetools.toMySQL()
print "ALL CLEAR! Have Fun!"

以上就是本文的全部內容，希望對大家的學習有所幫助，也希望大家多多支持腳本之家。

您可能感興趣的文章:

pytorch中的dataset用法詳解
這篇文章主要介紹了pytorch的dataset用法詳解,當我們繼承了一個?Dataset類之后，我們需要重寫?len?方法，該方法提供了dataset的大??；?getitem?方法，?該方法支持從?0?到?len(self)的索引,下面來看看附有代碼的講解吧，希望對你的學習或者工作有所幫助
2022-01-01
PySpark與GraphFrames的安裝與使用環(huán)境搭建過程
這篇文章主要介紹了PySpark與GraphFrames的安裝與使用教程，本文通過圖文并茂實例代碼相結合給大家介紹的非常詳細，對大家的學習或工作具有一定的參考借鑒價值,需要的朋友可以參考下
2022-02-02
詳解Python3.6的py文件打包生成exe
這篇文章給大家分享了Python3.6的py文件打包生成exe的方法步驟以及相關知識點，有需要的朋友可以參考學習下。
2018-07-07
取numpy數(shù)組的某幾行某幾列方法
下面小編就為大家分享一篇取numpy數(shù)組的某幾行某幾列方法，具有很好的參考價值，希望對大家有所幫助。一起跟隨小編過來看看吧
2018-04-04
淺析Python函數(shù)式編程
在本篇文章中我們給大家分享了關于Python函數(shù)式編程的相關知識點內容，有興趣的朋友參考下。
2018-10-10
變長雙向rnn的正確使用姿勢教學
這篇文章主要介紹了變長雙向rnn的正確使用姿勢，具有很好的參考價值，希望對大家有所幫助。如有錯誤或未考慮完全的地方，望不吝賜教
2021-05-05
python接口自動化測試數(shù)據(jù)和代碼分離解析
代碼的可維護性除了代碼冗余之外還有就是數(shù)據(jù)盡量不要和代碼摻雜在一起，因為閱讀起來會非常的凌亂；數(shù)據(jù)分離能更好的增加代碼可讀性和可維護性，也能更好的二次修改使用
2021-09-09
python將字符串轉變成dict格式的實現(xiàn)
這篇文章主要介紹了python將字符串轉變成dict格式的實現(xiàn)，文中通過示例代碼介紹的非常詳細，對大家的學習或者工作具有一定的參考學習價值，需要的朋友們下面隨著小編來一起學習學習吧
2019-11-11
socket連接關閉問題分析
socket建立連接的時候是三次握手，這個大家都很清楚，但是socket關閉連接的時候，需要進行四次揮手，但很多人對于這四次揮手的具體流程不清楚，吃了很多虧，本文來為大家進行分析
2022-01-01
Python自動化辦公之手機號提取
這篇文章主要介紹了如何利用Python語言編寫一個手機號提取器，文中的示例代碼講解詳細，對我們學習Python有一定的幫助，需要的可以參考一下
2022-06-06