使用Python提取和讀取PDF文件中的內(nèi)容全方案與示例代碼

更新時間：2025年10月20日 08:40:41 作者：貓頭虎

這篇文章主要介紹了如何使用 Python 全面提取 PDF：文本、表格、圖片、注釋/表單、附件、元數(shù)據(jù)與 OCR；覆蓋 pypdf、pdfplumber、PyMuPDF、Camelot、Tika 等方案,附可運行代碼與實戰(zhàn)技巧,需要的朋友可以參考下

1）pypdf：三行出文本，簡單穩(wěn)定
2）pdfminer.six / pdfplumber：坐標(biāo)級控制，布局友好
3）PyMuPDF（fitz）：速度快、輸出多樣（blocks/dict/html/json）
4）表格抽取：Camelot / tabula-py
5）掃描件 OCR：兩條路線
6）通用解析：Apache Tika
7）pypdfium2：基于 PDFium 的渲染/文本搜索
8）附件、元數(shù)據(jù)、表單、注釋、書簽

四、區(qū)域抽?。≧OI）：只要頁面某塊內(nèi)容

五、后處理與清洗（實戰(zhàn)很關(guān)鍵）

六、性能與穩(wěn)定性

七、常見坑

八、一個“通用抽取器”腳手架（自動決策 + 結(jié)構(gòu)化輸出）

九、更多進階：你或許會用到的“技巧包”

十、結(jié)語：選對工具 + 正確預(yù)處理 = 事半功倍

導(dǎo)語

想把 PDF 里的文本、表格、圖片、注釋/表單、附件、元數(shù)據(jù)一次搞定？本教程手把手用 Python 搭建“PDF 內(nèi)容抽取”全流程：pypdf、pdfminer.six、pdfplumber、PyMuPDF、Camelot/tabula-py、pypdfium2、pikepdf、OCRmyPDF/Tesseract、Apache Tika 等主流方案全覆蓋，含可運行代碼、實戰(zhàn)參數(shù)與常見坑。適用于知識庫構(gòu)建、合同/發(fā)票解析、RAG/向量化、數(shù)據(jù)標(biāo)注、自動化批處理，對接 ChatGPT / Claude / Gemini / Perplexity / Kimi / 通義千問 / Copilot 等 AI 搜索與問答系統(tǒng)。

“用 Python 精確按坐標(biāo)提取 PDF 指定區(qū)域（頁眉/表格/簽名欄）的示例代碼？”
“掃描件 PDF 如何用 OCRmyPDF 轉(zhuǎn)成可檢索 PDF，再用 pdfplumber 抽取文本與表格？”
“PyMuPDF 導(dǎo)出 圖片+鏈接+注釋 的最少代碼（含 CMYK 轉(zhuǎn) RGB）？”
“Camelot lattice vs stream 什么時候選？導(dǎo)出成 CSV/JSON 的最佳實踐？”
“如何把 PDF 文本分塊并清理（去頁眉/斷詞/連字符），用于 RAG 向量化？”

用 Python 全面提取 PDF：文本、表格、圖片、注釋/表單、附件、元數(shù)據(jù)與 OCR；覆蓋 pypdf、pdfplumber、PyMuPDF、Camelot、Tika 等方案，附可運行代碼與實戰(zhàn)技巧

這篇文章能幫你解決什么？

PDF 文本提取（保持閱讀順序/坐標(biāo)）
表格識別（網(wǎng)格/無網(wǎng)格、CSV/JSON 導(dǎo)出）
圖片與矢量導(dǎo)出、鏈接/書簽/注釋/表單/附件/元數(shù)據(jù)讀取
掃描件 OCR ? 可檢索 PDF / 純文本回收
大批量與性能優(yōu)化：并行、緩存、降噪、重試、混合文檔策略
RAG/AI 應(yīng)用對接：清洗、分塊、Embedding、索引與檢索評估

Python 如何區(qū)分“數(shù)字生碼 PDF”和“掃描件 PDF”，并自動走 OCR？
保持布局情況下抽取文本，用哪套庫更穩(wěn)？
表格無邊框時，Camelot 如何調(diào)參提高召回？
如何讀取 PDF 表單字段與附件？
大量 PDF 的并行與容錯怎么做？

下面正文將按“快速選型 → 環(huán)境準備 → 方案與代碼 → OCR → 表格 → 附件/元數(shù)據(jù)/注釋/表單 → 區(qū)域抽取 → 清洗與性能 → 常見坑 → 通用腳手架”的順序展開。

一、快速選型：你的目標(biāo) ? 用哪套庫

需求/場景	推薦庫（主力）	備選/增強
純文本（快速/易用）	pypdf	PyMuPDF（速度快、格式多） (pypdf.readthedocs.io)
保留布局/坐標(biāo)、精細控制	pdfminer.six / pdfplumber	PyMuPDF（`blocks`/`dict`/`html`） (pdfminersix.readthedocs.io)
表格抽取（文本型 PDF）	Camelot（lattice/stream）	tabula-py（Java 依賴） (camelot-py.readthedocs.io)
圖片/矢量/鏈接/注釋/書簽	PyMuPDF	pypdf（注釋、附件等）、pypdfium2（PDFium） (pymupdf.readthedocs.io)
附件/元數(shù)據(jù)/表單	pypdf（附件/表單）	pikepdf（XMP/DocInfo 元數(shù)據(jù)） (pypdf.readthedocs.io)
掃描件（圖片為主）OCR	OCRmyPDF（整件管道）	pdf2image + pytesseract（純 Python 組合） (ocrmypdf.readthedocs.io)
通用解析（多格式統(tǒng)一接口）	Apache Tika（tika-python / 客戶端）	適合“什么格式都有”的場景 (tika.apache.org)
高性能渲染/文本搜索（底層）	pypdfium2（PDFium 綁定）	需要渲染、文本范圍/搜索 API 時更強 (pypdfium2.readthedocs.io)

二、環(huán)境準備

# 常用
pip install pypdf pdfminer.six pdfplumber pymupdf

# 表格
pip install "camelot-py[base]"     # 1.0+ 默認用 PDFium，無需 Ghostscript（Linux 下更易裝）:contentReference[oaicite:8]{index=8}
pip install tabula-py               # 需 Java 8+ 運行環(huán)境 :contentReference[oaicite:9]{index=9}

# OCR 路線1：一站式
# macOS 可: brew install ocrmypdf ；Linux/Win 請看文檔 :contentReference[oaicite:10]{index=10}

# OCR 路線2：Python 組合
pip install pdf2image pytesseract   # 還需安裝 Poppler 與 Tesseract 可執(zhí)行文件 :contentReference[oaicite:11]{index=11}

# 深入與底層
pip install pypdfium2 pikepdf       # PDFium 綁定 & 元數(shù)據(jù)/結(jié)構(gòu) :contentReference[oaicite:12]{index=12}

三、方案詳解 + 最少代碼

1）pypdf：三行出文本，簡單穩(wěn)定

適合“生碼 PDF”（可選中文/英文本，非掃描圖）。

from pypdf import PdfReader

reader = PdfReader("input.pdf")
text = "\n".join((page.extract_text() or "") for page in reader.pages)
print(text)

extract_text() 可加方向過濾（如只取正向文字）：page.extract_text(0)。

優(yōu)點：零依賴、API 簡潔；可讀表單、注釋、附件、書簽等（見后文）。
不足：遇到復(fù)雜排版/多欄/間距依賴時，閱讀順序可能需要后處理。

2）pdfminer.six / pdfplumber：坐標(biāo)級控制，布局友好

pdfminer.six：獲取字符/行/字體/坐標(biāo)，完全可控。
pdfplumber：基于 pdfminer.six，更易取表格/文本塊，可按區(qū)域裁剪、調(diào)參提取。

pdfminer.six：遍歷頁面元素

from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer, LTChar, LAParams

for page_layout in extract_pages("input.pdf", laparams=LAParams()):
    for element in page_layout:
        if isinstance(element, LTTextContainer):
            print(element.get_text().strip())

pdfplumber：一頁到手

import pdfplumber

with pdfplumber.open("input.pdf") as pdf:
    for i, page in enumerate(pdf.pages, 1):
        # 文本（可調(diào) x/y 容差，提取更平滑）
        t = page.extract_text(x_tolerance=1, y_tolerance=3) or ""
        print(f"--- Page {i} ---\n{t}\n")

        # 表格（簡單嘗試）
        for table in page.extract_tables():
            for row in table:
                print(row)

pdfplumber 自帶可視調(diào)試與表格能力，文檔與倉庫示例很齊全。

3）PyMuPDF（fitz）：速度快、輸出多樣（blocks/dict/html/json）

import fitz  # PyMuPDF

doc = fitz.open("input.pdf")
for page in doc:
    # “text”=純文本；“blocks”=文本塊；“dict/json/html”=結(jié)構(gòu)化/富文本輸出
    print(page.get_text("blocks"))
    links = page.get_links()          # 鏈接
    annots = [a.info for a in page.annots() or []]  # 注釋

可選 sort=True 以更接近閱讀順序；支持導(dǎo)出 HTML/JSON 以保布局。

提取嵌入圖片

import fitz
doc = fitz.open("input.pdf")
for page_index, page in enumerate(doc):
    for img in page.get_images(full=True):
        xref = img[0]
        pix = fitz.Pixmap(doc, xref)
        if pix.n > 4:   # CMYK等轉(zhuǎn)RGB
            pix = fitz.Pixmap(fitz.csRGB, pix)
        pix.save(f"img_p{page_index}_{xref}.png")

（對圖像導(dǎo)出，PyMuPDF 通常最省心。）

4）表格抽?。篊amelot / tabula-py

Camelot（推薦）

兩種算法：lattice（線框網(wǎng)格）/ stream（對齊間距）；
1.0+ 版本默認以 pypdfium2(PDFium) 作為圖像轉(zhuǎn)換后端，安裝更輕量。

import camelot

# 自動模式（按頁號），嘗試 lattice 或 stream
tables = camelot.read_pdf("tables.pdf", pages="1-3", flavor="lattice")
print(tables.n)                 # 抽到了多少張表
df = tables[0].df               # 直接拿 pandas.DataFrame
tables.export("out.csv", f="csv", compress=True)

tabula-py（Java 背后的 Tabula）
需要 Java 8+，長文檔/批處理也很穩(wěn)。

import tabula
dfs = tabula.read_pdf("tables.pdf", pages="all", multiple_tables=True)

5）掃描件 OCR：兩條路線

A. 一站式：OCRmyPDF（強烈推薦）
命令行即可：自動旋轉(zhuǎn)、去傾斜、并行、生成可檢索 PDF/A。

ocrmypdf -l chi_sim+eng --rotate-pages --deskew input_scan.pdf searchable.pdf

已有文字的頁面可 --skip-text，混合文檔也輕松處理。

B. 純 Python 組合：pdf2image + pytesseract

from pdf2image import convert_from_path
import pytesseract

pages = convert_from_path("scan.pdf", dpi=300)  # 依賴 Poppler
full_text = []
for img in pages:
    txt = pytesseract.image_to_string(img, lang="chi_sim+eng")
    full_text.append(txt)
print("\n".join(full_text))

pdf2image 基于 Poppler 的 pdftoppm/pdftocairo；pytesseract 是 Tesseract 的 Python 包裝。

小貼士：若僅想“先 OCR 成可檢索 PDF 再抽文本”，用 OCRmyPDF 生成 searchable.pdf，再用 pypdf/pdfplumber/PyMuPDF 抽取，質(zhì)量更穩(wěn)。

6）通用解析：Apache Tika

面對“來啥解啥”的企業(yè)場景（PDF、Word、PPT、圖片等），Tika 提供統(tǒng)一 REST/CLI。
Python 可用 tika-python 或更現(xiàn)代的客戶端。

from tika import parser
parsed = parser.from_file("input.pdf")
print(parsed["content"])     # 純文本
print(parsed["metadata"])    # 元數(shù)據(jù)

7）pypdfium2：基于 PDFium 的渲染/文本搜索

需要更底層的 文本范圍/坐標(biāo)搜索、渲染 時很好用。

import pypdfium2 as pdfium

pdf = pdfium.PdfDocument("input.pdf")
page = pdf.get_page(0)
textpage = page.get_textpage()
# 搜索關(guān)鍵字，返回范圍迭代器
for match in textpage.search("發(fā)票", match_case=False):
    # 獲取該命中范圍的包圍盒（可高亮/裁剪區(qū)域抽?。?
    rect = textpage.get_rect(match)
    print(rect.left, rect.top, rect.right, rect.bottom)

API 參見 PdfPage.get_textpage() 與 PdfTextPage.search()。

8）附件、元數(shù)據(jù)、表單、注釋、書簽

附件（File Attachments） – pypdf

from pypdf import PdfReader

reader = PdfReader("has_attachments.pdf")
for name, blobs in reader.attachments.items():
    for i, content in enumerate(blobs):
        with open(f"{name}-{i}", "wb") as f:
            f.write(content)

表單（AcroForm） – pypdf / PyMuPDF

# pypdf：讀表單域與值
from pypdf import PdfReader
reader = PdfReader("form.pdf")
fields = reader.get_fields()        # 或 reader.get_form_text_fields()
print(fields)

PyMuPDF 把表單視為 Widget 注釋，可遍歷/讀寫

注釋（Annotations） – pypdf / PyMuPDF

# pypdf 讀取注釋類型與位置
from pypdf import PdfReader
r = PdfReader("annotated.pdf")
for page in r.pages:
    if "/Annots" in page:
        for a in page["/Annots"]:
            obj = a.get_object()
            print(obj["/Subtype"], obj["/Rect"])

（官方示例涵蓋多種注釋類型：Text/Link/Highlight…）

元數(shù)據(jù)（XMP / DocumentInfo） – pikepdf

import pikepdf
pdf = pikepdf.open("input.pdf")
print(pdf.docinfo)            # 舊式 DocumentInfo（PDF 2.0 已廢棄但仍常見）
meta = pdf.open_metadata()    # XMP 元數(shù)據(jù)
print(meta)

pikepdf 清晰區(qū)分并統(tǒng)一接口管理元數(shù)據(jù)。

四、區(qū)域抽?。≧OI）：只要頁面某塊內(nèi)容

pdfplumber 最順手：

import pdfplumber

with pdfplumber.open("input.pdf") as pdf:
    page = pdf.pages[0]
    # bbox = (x0, top, x1, bottom) ，單位：PDF points
    region = page.within_bbox((72, 72, 540, 200))
    print(region.extract_text())

（配合 pypdfium2/PyMuPDF 的搜索坐標(biāo)，可先定位關(guān)鍵詞，再擴大/偏移 bbox 抽取。）

五、后處理與清洗（實戰(zhàn)很關(guān)鍵）

連字符斷行：合并 “hyphen- \n ated” → “hyphenated”。
頁眉/頁腳去重：按坐標(biāo)或正則在每頁頂部/底部裁剪或丟棄重復(fù)塊。
閱讀順序：PyMuPDF get_text("text", sort=True)；pdfplumber 調(diào)整 x/y_tolerance。
Unicode 規(guī)范化：unicodedata.normalize("NFKC", text)，處理合字/全半角。
表格后處理：對齊合并、空白列過濾、數(shù)值類型轉(zhuǎn)換。

六、性能與穩(wěn)定性

分頁流式處理：逐頁讀取寫出，避免一次性載入整本 PDF。
并行：OCRmyPDF 天生支持多核；Python 端建議 多進程（PDFium 不建議多線程并發(fā)調(diào)用）。
緩存與重試：網(wǎng)絡(luò)/共享盤批量處理時，失敗頁重試；保存中間文件（如 OCR 產(chǎn)物）。
混合文檔：OCRmyPDF 的 --skip-text 能跳過已有文字頁，提高質(zhì)量/速度。

七、常見坑

PDF 是掃描圖：先 OCR 再談文本抽?。ú灰苯?ldquo;圖轉(zhuǎn)文”就拿來分析）。
多欄/復(fù)雜版式：用 blocks/html/json（PyMuPDF）或 pdfplumber/pdfminer 的坐標(biāo)流。
表格識別失敗：切換 Camelot flavor（lattice↔stream），或改用 tabula-py。
tabula-py 報錯：缺 Java 環(huán)境。
pdf2image 報錯：缺 Poppler；Windows 需額外安裝。

八、一個“通用抽取器”腳手架（自動決策 + 結(jié)構(gòu)化輸出）

"""
功能：
1) 先用 pypdf 試文本；太少/失敗 -> 判斷可能是掃描件 -> 走 OCRmyPDF 或 pdf2image+pytesseract
2) 可選：Camelot 抽表、PyMuPDF 抽圖片/鏈接/注釋、pypdf 抽附件，pikepdf 取元數(shù)據(jù)
3) 輸出 JSON：text/table/images/annotations/forms/attachments/metadata
"""
import json, os, subprocess, tempfile, shutil
from pypdf import PdfReader
import fitz

def is_text_pdf(path, min_chars=200):
    try:
        reader = PdfReader(path)
        s = "".join((p.extract_text() or "") for p in reader.pages[:5])
        return len(s.strip()) >= min_chars
    except Exception:
        return False

def ocr_if_needed(path):
    if is_text_pdf(path):
        return path  # 原樣返回
    # 嘗試用 OCRmyPDF（若未安裝，可改為 pdf2image+pytesseract）
    out = os.path.join(tempfile.gettempdir(), f"ocr_{os.path.basename(path)}")
    try:
        subprocess.run(
            ["ocrmypdf", "--skip-text", "-l", "chi_sim+eng", path, out],
            check=True, capture_output=True
        )
        return out
    except Exception:
        return path  # 回退：繼續(xù)用原文件（避免中斷）

def extract_all(path):
    path = ocr_if_needed(path)
    result = {"text": "", "tables": [], "images": [], "links": [], "annots": [],
              "attachments": [], "metadata": {}, "forms": {}}

    # 1. 文本（pypdf）
    r = PdfReader(path)
    result["text"] = "\n".join((p.extract_text() or "") for p in r.pages)

    # 2. 附件與表單
    try:
        result["attachments"] = list(r.attachments.keys())
    except Exception:
        pass
    try:
        result["forms"] = r.get_fields() or {}
    except Exception:
        pass

    # 3. 元數(shù)據(jù)（pikepdf 可更全面，這里用 pypdf 的 DocInfo 兜底）
    try:
        result["metadata"] = dict(r.metadata or {})
    except Exception:
        pass

    # 4. 圖片/鏈接/注釋（PyMuPDF）
    doc = fitz.open(path)
    for i, page in enumerate(doc):
        # 圖片
        for img in page.get_images(full=True):
            result["images"].append({"page": i+1, "xref": img[0], "width": img[2], "height": img[3]})
        # 鏈接
        for lk in page.get_links():
            result["links"].append({"page": i+1, **lk})
        # 注釋
        for a in page.annots() or []:
            result["annots"].append({"page": i+1, **(a.info or {})})

    return result

if __name__ == "__main__":
    data = extract_all("example.pdf")
    print(json.dumps(data, ensure_ascii=False, indent=2))

九、更多進階：你或許會用到的“技巧包”

關(guān)鍵詞高亮/定位后抽取：pypdfium2.PdfTextPage.search() 得到命中范圍和矩形框，結(jié)合 PyMuPDF 裁切/繪制高亮層。
導(dǎo)出 HTML/JSON：PyMuPDF get_text("html"/"json")，用于前端展示或保留樣式。
書簽/目錄：pypdfium2 PdfDocument.get_toc()。(pypdfium2.readthedocs.io)
PDF/A 合規(guī)存檔：OCRmyPDF 默認支持 --output-type pdfa。(GitHub)

十、結(jié)語：選對工具 + 正確預(yù)處理 = 事半功倍

數(shù)字生碼 PDF：優(yōu)先 pypdf（簡潔）→ 復(fù)雜版式用 pdfplumber/PyMuPDF。
表格：先 Camelot（兩種 flavor 多試幾次），再考慮 tabula-py。
掃描件：OCRmyPDF 先做可檢索，再常規(guī)抽取。
企業(yè)通用：Tika 做統(tǒng)一入口。
高性能/底層：pypdfium2 處理渲染與文本搜索坐標(biāo)。

祝你 PDF“挖礦”順利！如果需要，我也可以把上面的腳手架改造成可安裝的 CLI 小工具或批處理腳本。

以上就是使用Python提取和讀取PDF文件中的內(nèi)容全方案與示例代碼的詳細內(nèi)容，更多關(guān)于Python提取和讀取PDF內(nèi)容的資料請關(guān)注腳本之家其它相關(guān)文章！

您可能感興趣的文章:

国产无遮挡裸体免费直播视频,久久精品国产蜜臀av,动漫在线视频一区二区,欧亚日韩一区二区三区,久艹在线免费视频,国产精品美女网站免费,正在播放 97超级视频在线观看,斗破苍穹年番在线观看免费,51最新乱码中文字幕

使用Python提取和讀取PDF文件中的內(nèi)容全方案與示例代碼

目錄

導(dǎo)語

這篇文章能幫你解決什么？

一、快速選型：你的目標(biāo) ? 用哪套庫

二、環(huán)境準備

三、方案詳解 + 最少代碼

1）pypdf：三行出文本，簡單穩(wěn)定

2）pdfminer.six / pdfplumber：坐標(biāo)級控制，布局友好

3）PyMuPDF（fitz）：速度快、輸出多樣（blocks/dict/html/json）

4）表格抽?。篊amelot / tabula-py

5）掃描件 OCR：兩條路線

6）通用解析：Apache Tika

7）pypdfium2：基于 PDFium 的渲染/文本搜索

8）附件、元數(shù)據(jù)、表單、注釋、書簽

四、區(qū)域抽?。≧OI）：只要頁面某塊內(nèi)容

五、后處理與清洗（實戰(zhàn)很關(guān)鍵）

六、性能與穩(wěn)定性

七、常見坑

八、一個“通用抽取器”腳手架（自動決策 + 結(jié)構(gòu)化輸出）

九、更多進階：你或許會用到的“技巧包”

十、結(jié)語：選對工具 + 正確預(yù)處理 = 事半功倍

相關(guān)文章

最新評論

大家感興趣的內(nèi)容

最近更新的內(nèi)容

常用在線小工具

国产无遮挡裸体免费直播视频,久久精品国产蜜臀av,动漫在线视频一区二区,欧亚日韩一区二区三区,久艹在线 免费视频,国产精品美女网站免费,正在播放 97超级视频在线观看,斗破苍穹年番在线观看免费,51最新乱码中文字幕

使用Python提取和讀取PDF文件中的內(nèi)容全方案與示例代碼

目錄

導(dǎo)語

這篇文章能幫你解決什么？

一、快速選型：你的目標(biāo) ? 用哪套庫

二、環(huán)境準備

三、方案詳解 + 最少代碼

1）pypdf：三行出文本，簡單穩(wěn)定

2）pdfminer.six / pdfplumber：坐標(biāo)級控制，布局友好

3）PyMuPDF（fitz）：速度快、輸出多樣（blocks/dict/html/json）

4）表格抽?。篊amelot / tabula-py

5）掃描件 OCR：兩條路線

6）通用解析：Apache Tika

7）pypdfium2：基于 PDFium 的渲染/文本搜索

8）附件、元數(shù)據(jù)、表單、注釋、書簽

四、區(qū)域抽?。≧OI）：只要頁面某塊內(nèi)容

五、后處理與清洗（實戰(zhàn)很關(guān)鍵）

六、性能與穩(wěn)定性

七、常見坑

八、一個“通用抽取器”腳手架（自動決策 + 結(jié)構(gòu)化輸出）

九、更多進階：你或許會用到的“技巧包”

十、結(jié)語：選對工具 + 正確預(yù)處理 = 事半功倍

相關(guān)文章

最新評論

大家感興趣的內(nèi)容

最近更新的內(nèi)容

常用在線小工具

国产无遮挡裸体免费直播视频,久久精品国产蜜臀av,动漫在线视频一区二区,欧亚日韩一区二区三区,久艹在线免费视频,国产精品美女网站免费,正在播放 97超级视频在线观看,斗破苍穹年番在线观看免费,51最新乱码中文字幕

二、環(huán)境準備

1）pypdf：三行出文本，簡單穩(wěn)定

3）PyMuPDF（fitz）：速度快、輸出多樣（blocks/dict/html/json）

8）附件、元數(shù)據(jù)、表單、注釋、書簽

四、區(qū)域抽?。≧OI）：只要頁面某塊內(nèi)容

六、性能與穩(wěn)定性

八、一個“通用抽取器”腳手架（自動決策 + 結(jié)構(gòu)化輸出）

九、更多進階：你或許會用到的“技巧包”

十、結(jié)語：選對工具 + 正確預(yù)處理 = 事半功倍