Python Request爬取seo.chinaz.com百度權(quán)重網(wǎng)站的查詢結(jié)果過(guò)程解析

更新時(shí)間：2019年08月13日 09:42:05 作者：Leslie-x

這篇文章主要介紹了Request爬取網(wǎng)站（seo.chinaz.com）百度權(quán)重的查詢結(jié)果過(guò)程解析,文中通過(guò)示例代碼介紹的非常詳細(xì)，對(duì)大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價(jià)值,需要的朋友可以參考下

一：腳本需求

利用Python3查詢網(wǎng)站權(quán)重并自動(dòng)存儲(chǔ)在本地?cái)?shù)據(jù)庫(kù)（Mysql數(shù)據(jù)庫(kù)）中，同時(shí)導(dǎo)出一份網(wǎng)站權(quán)重查詢結(jié)果的EXCEL表格

數(shù)據(jù)庫(kù)類型：MySql

數(shù)據(jù)庫(kù)表單名稱：website_weight

表單內(nèi)容及表頭設(shè)置：表頭包含有id、main_url（即要查詢的網(wǎng)站）、website_weight（網(wǎng)站權(quán)重）

要查詢的網(wǎng)站：EXCEL表格

二：需求實(shí)現(xiàn)

一：利用openpyxl模塊解析excel文件，將查詢的網(wǎng)站讀取到一個(gè)列表中保存

# 解析excel文件，取出所有的url
def get_urls(file_path):
 wb = load_workbook(file_path)
 sheet = wb.active
 urls = []
 for cell in list(sheet.columns)[1]:
 if cell != sheet['B1']:
  urls.append(cell.value)
 return wb, urls

二：分析請(qǐng)求發(fā)送，偽造請(qǐng)求，取得HTML頁(yè)面

# 偽造請(qǐng)求，取得html頁(yè)面
def get_html(url):
 # 定義http的請(qǐng)求Header
 headers = {} 
 # random.randint(1,99) 為了生成1到99之間的隨機(jī)數(shù)，讓UserAgent變的不同。
 headers[
 'User-Agent'] = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537." + str(
 random.randint(1, 99))
 # Referer地址使用待查詢的網(wǎng)址
 headers['Referer'] = "http://seo.chinaz.com/" + url + "/"
 html = ''
 try:
 html = requests.get("http://seo.chinaz.com/" + url + "/", headers=headers, timeout=5).text
 except Exception:
 pass
 return html

三：分析HTML頁(yè)面，利用BeautifulSoup模塊提取數(shù)據(jù)

# 利用BeautifulSoup模塊從html頁(yè)面中提取數(shù)據(jù)
def get_data(html, url):
 if not html:
 return url, 0
 soup = bs(html, "lxml")
 p_tag = soup.select("p.ReLImgCenter")[0]
 src = p_tag.img.attrs["src"]
 regexp = re.compile(r'^http:.*?(\d).gif')
 br = regexp.findall(src)[0]
 return url, br

四：數(shù)據(jù)庫(kù)連接配置，并獲取游標(biāo)

# 連接數(shù)據(jù)庫(kù)
def get_connect():
 conn = pymysql.connect(
 host='127.0.0.1',
 port=3306,
 user='root',
 passwd='root',
 db='seotest',
 charset="utf8")
 # 獲取游標(biāo)對(duì)象
 cursor = conn.cursor()
 return conn, cursor

五：主程序邏輯編寫

if __name__ == "__main__":
 #命令行執(zhí)行腳本文件，獲取excel文件路徑
 file_path = sys.argv[1]
 #獲取URL列表和excle工作簿
 wb, urls = get_urls(file_path)
 #獲取數(shù)據(jù)庫(kù)連接和游標(biāo)
 conn, cursor = get_connect()
 #獲取工作簿當(dāng)前工作sheet
 sheet = wb.active
 #數(shù)據(jù)庫(kù)插入語(yǔ)句
 sql_insert = '''insert into website_weight(main_url, website_weight) values (%s, %s)'''
 
 for row, url in enumerate(urls):
 if not url: continue
 html = get_html(url)
 data = get_data(html, url)
 # 插入數(shù)據(jù)到數(shù)據(jù)庫(kù)
 cursor.execute(sql_insert, data)
 # 插入數(shù)據(jù)到Excel表中
 cell = sheet.cell(row=row + 2, column=3)
 cell.value = data[1]
 # 終端打印插入的數(shù)據(jù)
 print(data)
 conn.commit()
 conn.close()
 wb.save(file_path)
 wb.close()

# cmd命令：python3 F:\算法與結(jié)構(gòu)\網(wǎng)站權(quán)重.py F:\website.xlsx

三：腳本運(yùn)行及其實(shí)現(xiàn)結(jié)果

CMD執(zhí)行