使用Python實(shí)現(xiàn)網(wǎng)頁(yè)表格轉(zhuǎn)換為markdown

更新時(shí)間：2025年05月28日 15:28:09 作者：嘆一曲當(dāng)時(shí)只道是尋常

在日常工作中,我們經(jīng)常需要從網(wǎng)頁(yè)上復(fù)制表格數(shù)據(jù),并將其轉(zhuǎn)換成Markdown格式,本文將使用Python編寫一個(gè)網(wǎng)頁(yè)表格轉(zhuǎn)Markdown工具,需要的可以參考下

在日常工作中，我們經(jīng)常需要從網(wǎng)頁(yè)上復(fù)制表格數(shù)據(jù)，并將其轉(zhuǎn)換成Markdown格式，以便在文檔、郵件或論壇中使用。然而，手動(dòng)轉(zhuǎn)換不僅耗時(shí)，還容易出錯(cuò)。今天，就為大家?guī)?lái)一款網(wǎng)頁(yè)表格轉(zhuǎn)Markdown的利器，幫你一鍵完成轉(zhuǎn)換，輕松應(yīng)對(duì)各種場(chǎng)景！

場(chǎng)景需求

想象一下，你需要從公司內(nèi)部網(wǎng)站復(fù)制一份銷售數(shù)據(jù)表，并將其發(fā)送給團(tuán)隊(duì)成員。你希望他們能夠輕松查看和編輯這份數(shù)據(jù)，但直接復(fù)制粘貼往往格式錯(cuò)亂。這時(shí)，如果有一個(gè)工具能將表格自動(dòng)轉(zhuǎn)換為Markdown格式，那該有多好！

解決方案

本文介紹的Python腳本，正是為解決這一問(wèn)題而生。它利用requests-html庫(kù)抓取網(wǎng)頁(yè)內(nèi)容，并通過(guò)自定義函數(shù)table_to_markdown將HTML表格轉(zhuǎn)換為Markdown格式。不僅如此，腳本還能處理復(fù)雜的單元格合并，確保轉(zhuǎn)換后的表格美觀、準(zhǔn)確。

核心功能

自動(dòng)提取鏈接：腳本會(huì)識(shí)別并保留表格中的超鏈接，轉(zhuǎn)換成Markdown格式的鏈接。
合并單元格支持：即使是跨越多行或多列的單元格，也能正確轉(zhuǎn)換。
異常處理：在提取過(guò)程中遇到任何異常，都會(huì)友好提示，保證程序的健壯性。

安裝依賴

確保你已經(jīng)安裝了Python及以下庫(kù)：

pip install requests-html pyperclip

使用方法

運(yùn)行腳本，輸入目標(biāo)網(wǎng)頁(yè)的URL。

腳本會(huì)自動(dòng)抓取網(wǎng)頁(yè)上的第一個(gè)表格。

轉(zhuǎn)換后的Markdown表格將顯示在終端，并自動(dòng)復(fù)制到剪貼板。

示例代碼

from rich import print  
from rich.progress import track
from rich.console import Console
from rich.logging import RichHandler
import logging
from requests_html import HTMLSession
import pyperclip
from collections import defaultdict
import sys

# 初始化日志
console = Console()
logging.basicConfig(
    level="INFO", format="%(message)s", datefmt="[%X]", handlers=[RichHandler(console=console)]
)
log = logging.getLogger("rich")


def extract_links(element):
    """安全處理鏈接提取，避免索引越界"""
    if not element.html:
        return ""
    
    # 直接處理元素內(nèi)容，不依賴body標(biāo)簽
    text_parts = []
    processed_anchors = set()
    
    # 先處理所有<a>標(biāo)簽
    for a in element.find('a'):
        href = a.attrs.get('href', '')
        if href and a.text:
            text_parts.append(f"[{a.text.strip()}]({href})")
            # 標(biāo)記這個(gè)a標(biāo)簽的文本節(jié)點(diǎn)已處理
            processed_anchors.add(a.text.strip())
    
    # 添加非鏈接文本（且未被鏈接包含的文本）
    full_text = element.text
    for anchor_text in processed_anchors:
        full_text = full_text.replace(anchor_text, '')
    
    if full_text.strip():
        text_parts.insert(0, full_text.strip())
    
    return ' '.join(text_parts).strip()

def table_to_markdown(table):
    """處理合并單元格的表格轉(zhuǎn)換"""
    rows = table.find('tr')
    if not rows:
        return ""

    # 初始化數(shù)據(jù)結(jié)構(gòu)
    table_grid = []
    rowspan_tracker = defaultdict(dict)  # {row: {col: (content, remaining_span)}}
    max_cols = 0

    for row_idx, row in enumerate(rows):
        cells = row.find('th, td')
        col_idx = 0
        current_row = []

        # 處理活躍的跨行單元格
        while col_idx in rowspan_tracker.get(row_idx, {}):
            content, remaining = rowspan_tracker[row_idx][col_idx]
            current_row.append(content)
            if remaining > 1:
                rowspan_tracker[row_idx + 1][col_idx] = (content, remaining - 1)
            col_idx += 1

        # 處理當(dāng)前單元格
        for cell in cells:
            # 跳過(guò)已填充位置
            while col_idx < len(current_row) and current_row[col_idx] is not None:
                col_idx += 1

            # 解析單元格（處理鏈接和合并屬性）
            content = extract_links(cell)
            colspan = int(cell.attrs.get('colspan', 1))
            rowspan = int(cell.attrs.get('rowspan', 1))

            # 主單元格
            current_row.append(content)
            
            # 列合并處理（復(fù)制內(nèi)容）
            for _ in range(1, colspan):
                current_row.append(content)
            
            # 行合并處理
            if rowspan > 1:
                for r in range(1, rowspan):
                    if row_idx + r not in rowspan_tracker:
                        rowspan_tracker[row_idx + r] = {}
                    rowspan_tracker[row_idx + r][col_idx] = (content, rowspan - r)
            
            col_idx += colspan

        # 填充空白
        current_row = [cell if cell is not None else "" for cell in current_row]
        table_grid.append(current_row)
        max_cols = max(max_cols, len(current_row))

    # 統(tǒng)一列寬
    for row in table_grid:
        row.extend([""] * (max_cols - len(row)))

    # 生成Markdown
    markdown = []
    if table_grid:
        # 表頭
        markdown.append("| " + " | ".join(table_grid[0]) + " |")
        markdown.append("| " + " | ".join(["---"] * len(table_grid[0])) + " |")
        
        # 表格內(nèi)容
        for row in table_grid[1:]:
            markdown.append("| " + " | ".join(row) + " |")

    return "\n".join(markdown)

def get_table_as_markdown(url, table_index=0, timeout=20):
    try:
        session = HTMLSession()
        response = session.get(url)
        response.html.render(timeout=timeout)
        
        tables = response.html.find('table')
        if not tables:
            return "未找到表格"
            
        if table_index >= len(tables):
            return f"表格索引超出范圍（共 {len(tables)} 個(gè)表格）"
            
        return table_to_markdown(tables[table_index])
        
    except Exception as e:
        return f"錯(cuò)誤: {str(e)}"

if __name__ == "__main__":
    url = input("請(qǐng)輸入網(wǎng)頁(yè)URL: ")
    result = get_table_as_markdown(url)
    
    print("\n生成的Markdown表格:\n")
    print(result)
    
    try:
        pyperclip.copy(result)
        print("\n? 已復(fù)制到剪貼板")
    except:
        print("\n?? 無(wú)法復(fù)制到剪貼板，請(qǐng)手動(dòng)復(fù)制")

結(jié)語(yǔ)

這款腳本不僅能大幅提升你的工作效率，還能確保表格格式的準(zhǔn)確性和一致性。無(wú)論是日常辦公還是學(xué)術(shù)研究，它都是你不可或缺的好幫手。趕快試試吧，讓你的數(shù)據(jù)處理工作變得簡(jiǎn)單又高效。

到此這篇關(guān)于使用Python實(shí)現(xiàn)網(wǎng)頁(yè)表格轉(zhuǎn)換為markdown的文章就介紹到這了,更多相關(guān)Python網(wǎng)頁(yè)轉(zhuǎn)markdown內(nèi)容請(qǐng)搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家！

您可能感興趣的文章: