利用PyCharm Profile分析異步爬蟲(chóng)效率詳解

更新時(shí)間：2019年05月08日 10:30:27 作者：長(zhǎng)江CJ

這篇文章主要給大家介紹了關(guān)于如何利用PyCharm Profile分析異步爬蟲(chóng)效率的相關(guān)資料，文中通過(guò)示例代碼介紹的非常詳細(xì)，對(duì)大家學(xué)習(xí)或者使用PyCharm具有一定的參考學(xué)習(xí)價(jià)值，需要的朋友們下面來(lái)一起學(xué)習(xí)學(xué)習(xí)吧

今天比較忙，水一下

下面的代碼來(lái)源于這個(gè)視頻里面提到的，github 的鏈接為：github.com/mikeckenned…（本地下載）

第一個(gè)代碼如下，就是一個(gè)普通的 for 循環(huán)爬蟲(chóng)。原文地址。

import requests
import bs4
from colorama import Fore


def main():
 get_title_range()
 print("Done.")


def get_html(episode_number: int) -> str:
 print(Fore.YELLOW + f"Getting HTML for episode {episode_number}", flush=True)

 url = f'https://talkpython.fm/{episode_number}'
 resp = requests.get(url)
 resp.raise_for_status()

 return resp.text


def get_title(html: str, episode_number: int) -> str:
 print(Fore.CYAN + f"Getting TITLE for episode {episode_number}", flush=True)
 soup = bs4.BeautifulSoup(html, 'html.parser')
 header = soup.select_one('h1')
 if not header:
  return "MISSING"

 return header.text.strip()


def get_title_range():
 # Please keep this range pretty small to not DDoS my site. ;)
 for n in range(185, 200):
  html = get_html(n)
  title = get_title(html, n)
  print(Fore.WHITE + f"Title found: {title}", flush=True)


if __name__ == '__main__':
 main()

這段代碼跑完花了37s，然后我們用 pycharm 的 profiler 工具來(lái)具體看看哪些地方比較耗時(shí)間。

點(diǎn)擊Profile (文件名稱(chēng))

之后獲取到得到一個(gè)詳細(xì)的函數(shù)調(diào)用關(guān)系、耗時(shí)圖：

可以看到 get_html 這個(gè)方法占了96.7%的時(shí)間。這個(gè)程序的 IO 耗時(shí)達(dá)到了97%，獲取 html 的時(shí)候，這段時(shí)間內(nèi)程序就在那死等著。如果我們能夠讓他不要在那兒傻傻地等待 IO 完成，而是開(kāi)始干些其他有意義的事，就能節(jié)省大量的時(shí)間。

稍微做一個(gè)計(jì)算，試用asyncio異步抓取，能將時(shí)間降低多少？

get_html這個(gè)方法耗時(shí)36.8s，一共調(diào)用了15次，說(shuō)明實(shí)際上獲取一個(gè)鏈接的 html 的時(shí)間為36.8s / 15 = 2.4s。**要是全異步的話(huà)，獲取15個(gè)鏈接的時(shí)間還是2.4s。**然后加上get_title這個(gè)函數(shù)的耗時(shí)0.6s，所以我們估算，改進(jìn)后的程序?qū)⒖梢杂?3s 左右的時(shí)間完成，也就是性能能夠提升13倍。

再看下改進(jìn)后的代碼。原文地址。

import asyncio
from asyncio import AbstractEventLoop

import aiohttp
import requests
import bs4
from colorama import Fore


def main():
 # Create loop
 loop = asyncio.get_event_loop()
 loop.run_until_complete(get_title_range(loop))
 print("Done.")


async def get_html(episode_number: int) -> str:
 print(Fore.YELLOW + f"Getting HTML for episode {episode_number}", flush=True)

 # Make this async with aiohttp's ClientSession
 url = f'https://talkpython.fm/{episode_number}'
 # resp = await requests.get(url)
 # resp.raise_for_status()

 async with aiohttp.ClientSession() as session:
  async with session.get(url) as resp:
   resp.raise_for_status()

   html = await resp.text()
   return html


def get_title(html: str, episode_number: int) -> str:
 print(Fore.CYAN + f"Getting TITLE for episode {episode_number}", flush=True)
 soup = bs4.BeautifulSoup(html, 'html.parser')
 header = soup.select_one('h1')
 if not header:
  return "MISSING"

 return header.text.strip()


async def get_title_range(loop: AbstractEventLoop):
 # Please keep this range pretty small to not DDoS my site. ;)
 tasks = []
 for n in range(190, 200):
  tasks.append((loop.create_task(get_html(n)), n))

 for task, n in tasks:
  html = await task
  title = get_title(html, n)
  print(Fore.WHITE + f"Title found: {title}", flush=True)


if __name__ == '__main__':
 main()

同樣的步驟生成profile 圖：