Python實(shí)現(xiàn)的爬蟲(chóng)功能代碼
本文實(shí)例講述了Python實(shí)現(xiàn)的爬蟲(chóng)功能。分享給大家供大家參考,具體如下:
主要用到urllib2、BeautifulSoup模塊
#encoding=utf-8
import re
import requests
import urllib2
import datetime
import MySQLdb
from bs4 import BeautifulSoup
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
class Splider(object):
def __init__(self):
print u'開(kāi)始爬取內(nèi)容...'
##用來(lái)獲取網(wǎng)頁(yè)源代碼
def getsource(self,url):
headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2652.0 Safari/537.36'}
req = urllib2.Request(url=url,headers=headers)
socket = urllib2.urlopen(req)
content = socket.read()
socket.close()
return content
##changepage用來(lái)生產(chǎn)不同頁(yè)數(shù)的鏈接
def changepage(self,url,total_page):
now_page = int(re.search('page/(\d+)',url,re.S).group(1))
page_group = []
for i in range(now_page,total_page+1):
link = re.sub('page/(\d+)','page/%d' % i,url,re.S)
page_group.append(link)
return page_group
#獲取字內(nèi)容
def getchildrencon(self,child_url):
conobj = {}
content = self.getsource(child_url)
soup = BeautifulSoup(content, 'html.parser', from_encoding='utf-8')
content = soup.find('div',{'class':'c-article_content'})
img = re.findall('src="(.*?)"',str(content),re.S)
conobj['con'] = content.get_text()
conobj['img'] = (';').join(img)
return conobj
##獲取內(nèi)容
def getcontent(self,html_doc):
soup = BeautifulSoup(html_doc, 'html.parser', from_encoding='utf-8')
tag = soup.find_all('div',{'class':'promo-feed-headline'})
info = {}
i = 0
for link in tag:
info[i] = {}
title_desc = link.find('h3')
info[i]['title'] = title_desc.get_text()
post_date = link.find('div',{'class':'post-date'})
pos_d = post_date['data-date'][0:10]
info[i]['content_time'] = pos_d
info[i]['source'] = 'whowhatwear'
source_link = link.find('a',href=re.compile(r"section=fashion-trends"))
source_url = 'http://www.whowhatwear.com'+source_link['href']
info[i]['source_url'] = source_url
in_content = self.getsource(source_url)
in_soup = BeautifulSoup(in_content, 'html.parser', from_encoding='utf-8')
soup_content = in_soup.find('section',{'class':'widgets-list-content'})
info[i]['content'] = soup_content.get_text().strip('\n')
text_con = in_soup.find('section',{'class':'text'})
summary = text_con.get_text().strip('\n') if text_con.text != None else NULL
info[i]['summary'] = summary[0:200]+'...';
img_list = re.findall('src="(.*?)"',str(soup_content),re.S)
info[i]['imgs'] = (';').join(img_list)
info[i]['create_time'] = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
i+=1
#print info
#exit()
return info
def saveinfo(self,content_info):
conn = MySQLdb.Connect(host='127.0.0.1',user='root',passwd='123456',port=3306,db='test',charset='utf8')
cursor = conn.cursor()
for each in content_info:
for k,v in each.items():
sql = "insert into t_fashion_spider2(`title`,`summary`,`content`,`content_time`,`imgs`,`source`,`source_url`,`create_time`) values ('%s','%s','%s','%s','%s','%s','%s','%s')" % (MySQLdb.escape_string(v['title']),MySQLdb.escape_string(v['summary']),MySQLdb.escape_string(v['content']),v['content_time'],v['imgs'],v['source'],v['source_url'],v['create_time'])
cursor.execute(sql)
conn.commit()
cursor.close()
conn.close()
if __name__ == '__main__':
classinfo = []
p_num = 5
url = 'http://www.whowhatwear.com/section/fashion-trends/page/1'
jikesplider = Splider()
all_links = jikesplider.changepage(url,p_num)
for link in all_links:
print u'正在處理頁(yè)面:' + link
html = jikesplider.getsource(link)
info = jikesplider.getcontent(html)
classinfo.append(info)
jikesplider.saveinfo(classinfo)
更多關(guān)于Python相關(guān)內(nèi)容可查看本站專(zhuān)題:《Python Socket編程技巧總結(jié)》、《Python數(shù)據(jù)結(jié)構(gòu)與算法教程》、《Python函數(shù)使用技巧總結(jié)》、《Python字符串操作技巧匯總》、《Python入門(mén)與進(jìn)階經(jīng)典教程》及《Python文件與目錄操作技巧匯總》
希望本文所述對(duì)大家Python程序設(shè)計(jì)有所幫助。
- Python爬蟲(chóng)實(shí)例爬取網(wǎng)站搞笑段子
- Python3.4編程實(shí)現(xiàn)簡(jiǎn)單抓取爬蟲(chóng)功能示例
- Python網(wǎng)絡(luò)爬蟲(chóng)與信息提取(實(shí)例講解)
- python利用urllib實(shí)現(xiàn)爬取京東網(wǎng)站商品圖片的爬蟲(chóng)實(shí)例
- python制作小說(shuō)爬蟲(chóng)實(shí)錄
- python爬蟲(chóng)實(shí)戰(zhàn)之最簡(jiǎn)單的網(wǎng)頁(yè)爬蟲(chóng)教程
- Python 爬蟲(chóng)之超鏈接 url中含有中文出錯(cuò)及解決辦法
- 基于python爬蟲(chóng)數(shù)據(jù)處理(詳解)
- python爬蟲(chóng)入門(mén)教程--HTML文本的解析庫(kù)BeautifulSoup(四)
- Python爬蟲(chóng)之模擬知乎登錄的方法教程
- python爬蟲(chóng)入門(mén)教程--優(yōu)雅的HTTP庫(kù)requests(二)
- Python爬蟲(chóng)實(shí)現(xiàn)(偽)球迷速成
相關(guān)文章
Python實(shí)現(xiàn)提高運(yùn)行速度的技巧分享
這篇文章主要為大家詳細(xì)介紹了Python實(shí)現(xiàn)提高運(yùn)行速度的相關(guān)技巧,文中的示例代碼講解詳細(xì),具有一定的參考價(jià)值,感興趣的小伙伴可以跟隨小編一起了解一下2023-06-06
pandas數(shù)據(jù)探索之合并數(shù)據(jù)示例詳解
這篇文章主要為大家介紹了pandas數(shù)據(jù)探索之合并數(shù)據(jù)示例詳解,有需要的朋友可以借鑒參考下,希望能夠有所幫助,祝大家多多進(jìn)步,早日升職加薪2023-10-10
Python三目運(yùn)算符(三元運(yùn)算符)用法詳解(含實(shí)例代碼)
三元運(yùn)算符在Python里被稱(chēng)為條件表達(dá)式,這些表達(dá)式基于真(true)/假(false)的條件判斷,在Python 2.4以上才有了三元操作,下面這篇文章主要給大家介紹了關(guān)于Python三目運(yùn)算符(三元運(yùn)算符)用法的相關(guān)資料,需要的朋友可以參考下2023-02-02
Python+wxauto實(shí)現(xiàn)微信自動(dòng)化操作
在眾多自動(dòng)化工具中,Python的wxauto庫(kù)以其強(qiáng)大的功能和簡(jiǎn)單易用的特點(diǎn),為我們打開(kāi)了微信自動(dòng)化操作的大門(mén),下面我們就來(lái)看看它的具體操作吧2025-02-02
Django Rest Framework構(gòu)建API的實(shí)現(xiàn)示例
本文主要介紹了Django Rest Framework構(gòu)建API的實(shí)現(xiàn)示例,包含環(huán)境設(shè)置、數(shù)據(jù)序列化、視圖與路由配置、安全性和權(quán)限設(shè)置、以及測(cè)試和文檔生成這幾個(gè)步驟,具有一定的參考價(jià)值,感興趣的可以了解一下2024-08-08
Django中更改默認(rèn)數(shù)據(jù)庫(kù)為mysql的方法示例
這篇文章主要介紹了Django中更改默認(rèn)數(shù)據(jù)庫(kù)為mysql的方法示例,小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,也給大家做個(gè)參考。一起跟隨小編過(guò)來(lái)看看吧2018-12-12

