Python實(shí)現(xiàn)PDF轉(zhuǎn)Word的方法詳解

更新時(shí)間：2023年02月16日 11:01:23 作者：Sir?老王

由于PDF的文件大多都是只讀文件，有時(shí)候?yàn)榱藵M足可以編輯的需要通?？梢詫DF文件直接轉(zhuǎn)換成Word文件進(jìn)行操作。本文為大家整理了一些實(shí)現(xiàn)方法，希望對大家有所幫助

由于PDF的文件大多都是只讀文件，有時(shí)候?yàn)榱藵M足可以編輯的需要通?？梢詫DF文件直接轉(zhuǎn)換成Word文件進(jìn)行操作。

看了網(wǎng)絡(luò)上面的python轉(zhuǎn)換PDF文件為Word的相關(guān)文章感覺都比較復(fù)雜，并且關(guān)于一些圖表的使用還要進(jìn)行特殊的處理。

本篇文章主要講解關(guān)于如何使用python是實(shí)現(xiàn)將PDF轉(zhuǎn)換成Word的業(yè)務(wù)過程，這次沒有使用GUI應(yīng)用的操作。

由于可能存在版本沖突的問題，這里將開發(fā)過程中需要使用的python非標(biāo)準(zhǔn)庫的版本列舉出來。

python內(nèi)核版本：3.6.8
PyMuPDF版本：1.18.17
pdf2docx版本：0.5.1

可以選擇pip的方式對使用到的python非標(biāo)準(zhǔn)庫進(jìn)行安裝。

pip?install?PyMuPDF==1.18.17

pip?install?pdf2docx==0.5.1

完成上述的python依賴庫安裝以后，將pdf2docx導(dǎo)入到我們的代碼塊中。

#?Importing?the?Converter?class?from?the?pdf2docx?module.
from?pdf2docx?import?Converter

然后，編寫業(yè)務(wù)函數(shù)的代碼塊，新建一個(gè)pdfToWord函數(shù)來處理轉(zhuǎn)換邏輯，主要就幾行代碼可以實(shí)現(xiàn)比較簡單。

def?pdfToWord(pdf_file_path=None,?word_file_path=None):
????"""
????It?takes?a?pdf?file?path?and?a?word?file?path?as?input,?and?converts?the?pdf?file?to?a?word?file.

????:param?pdf_file_path:?The?path?to?the?PDF?file?you?want?to?convert
????:param?word_file_path:?The?path?to?the?word?file?that?you?want?to?create
????"""
????#?Creating?a?Converter?object.
????converter_?=?Converter(pdf_file_path)
????#?The?`convert`?method?takes?the?path?to?the?word?file?that?you?want?to?create,?and?the?start?and?end?pages?of?the?PDF
????#?file?that?you?want?to?convert.
????converter_.convert(word_file_path,?start=0,?end=None)
????converter_.close()

最后，使用main函數(shù)調(diào)用pdfToWord函數(shù)可以直接完成文檔格式的轉(zhuǎn)換。

#?A?special?variable?in?Python?that?evaluates?to?`True`?if?the?module?is?being?run?directly?by?the?Python?interpreter,?and
#?`False`?if?it?has?been?imported?by?another?module.
if?__name__?==?'__main__':
????pdfToWord('D:/test-data-work/test_pdf.pdf',?'D:/test-data-work/test_pdf.docx')

#?Parsing?Page?2:?2/5...Ignore?Line?"∑"?due?to?overlap
#?Ignore?Line?"∑"?due?to?overlap
#?Ignore?Line?"?"?due?to?overlap
#?Ignore?Line?"Ａ"?due?to?overlap
#?Ignore?Line?"ｉ?＝１"?due?to?overlap
#?Ignore?Line?"?"?due?to?overlap
#?Parsing?Page?5:?5/5...
#?Creating?Page?5:?5/5...
#?--------------------------------------------------
#?Terminated?in?3.2503201s.

方法補(bǔ)充

除了上面的方法，小編還為大家準(zhǔn)備了其他方法，需要的小伙伴可以了解一下

方法一：

from pdf2docx import Converter
import PySimpleGUI as sg
 
 
def pdf2word(file_path):
    file_name = file_path.split('.')[0]
    doc_file = f'{file_name}.docx'
    p2w = Converter(file_path)
    p2w.convert(doc_file, start=0, end=None)
    p2w.close()
    return doc_file
 
 
def main():
    # 選擇主題
    sg.theme('DarkAmber')
 
    layout = [
        [sg.Text('pdfToword', font=('微軟雅黑', 12)),
         sg.Text('', key='filename', size=(50, 1), font=('微軟雅黑', 10))],
        [sg.Output(size=(80, 10), font=('微軟雅黑', 10))],
        [sg.FilesBrowse('選擇文件', key='file', target='filename'), sg.Button('開始轉(zhuǎn)換'), sg.Button('退出')]]
    # 創(chuàng)建窗口
    window = sg.Window("張臥虎", layout, font=("微軟雅黑", 15), default_element_size=(50, 1))
    # 事件循環(huán)
    while True:
        # 窗口的讀取，有兩個(gè)返回值（1.事件；2.值）
        event, values = window.read()
        print(event, values)
 
        if event == "開始轉(zhuǎn)換":
 
            if values['file'] and values['file'].split('.')[1] == 'pdf':
                filename = pdf2word(values['file'])
                print('文件個(gè)數(shù) ：1')
                print('\n' + '轉(zhuǎn)換成功！' + '\n')
                print('文件保存位置：', filename)
            elif values['file'] and values['file'].split(';')[0].split('.')[1] == 'pdf':
                print('文件個(gè)數(shù) ：{}'.format(len(values['file'].split(';'))))
                for f in values['file'].split(';'):
                    filename = pdf2word(f)
                    print('\n' + '轉(zhuǎn)換成功！' + '\n')
                    print('文件保存位置：', filename)
            else:
                print('請選擇pdf格式的文件哦!')
        if event in (None, '退出'):
            break
 
    window.close()
main()

方法二：

加密過的PDF轉(zhuǎn)word

#-*- coding: UTF-8 -*- 
#!/usr/bin/python
#-*- coding: utf-8 -*-
import sys
import importlib
importlib.reload(sys)
from pdfminer.pdfparser import PDFParser,PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import *
from pdfminer.pdfinterp import PDFTextExtractionNotAllowed
import os
#設(shè)置工作目錄文件夾
os.chdir(r'c:/users/dicey/desktop/codes/pdf-docx')
#解析pdf文件函數(shù)
def parse(pdf_path):
 fp = open('diya.pdf', 'rb') # 以二進(jìn)制讀模式打開
 # 用文件對象來創(chuàng)建一個(gè)pdf文檔分析器
 parser = PDFParser(fp)
 # 創(chuàng)建一個(gè)PDF文檔
 doc = PDFDocument()
 # 連接分析器 與文檔對象
 parser.set_document(doc)
 doc.set_parser(parser)
 # 提供初始化密碼
 # 如果沒有密碼 就創(chuàng)建一個(gè)空的字符串
 doc.initialize()
 # 檢測文檔是否提供txt轉(zhuǎn)換，不提供就忽略
 if not doc.is_extractable:
  raise PDFTextExtractionNotAllowed
 else:
  # 創(chuàng)建PDf 資源管理器 來管理共享資源
  rsrcmgr = PDFResourceManager()
  # 創(chuàng)建一個(gè)PDF設(shè)備對象
  laparams = LAParams()
  device = PDFPageAggregator(rsrcmgr, laparams=laparams)
  # 創(chuàng)建一個(gè)PDF解釋器對象
  interpreter = PDFPageInterpreter(rsrcmgr, device)
  # 用來計(jì)數(shù)頁面，圖片，曲線，figure，水平文本框等對象的數(shù)量
  num_page, num_image, num_curve, num_figure, num_TextBoxHorizontal = 0, 0, 0, 0, 0
  # 循環(huán)遍歷列表，每次處理一個(gè)page的內(nèi)容
  for page in doc.get_pages(): # doc.get_pages() 獲取page列表
   num_page += 1 # 頁面增一
   interpreter.process_page(page)
   # 接受該頁面的LTPage對象
   layout = device.get_result()
   for x in layout:
    if isinstance(x,LTImage): # 圖片對象
     num_image += 1
    if isinstance(x,LTCurve): # 曲線對象
     num_curve += 1
    if isinstance(x,LTFigure): # figure對象
     num_figure += 1
    if isinstance(x, LTTextBoxHorizontal): # 獲取文本內(nèi)容
     num_TextBoxHorizontal += 1 # 水平文本框?qū)ο笤鲆?
     # 保存文本內(nèi)容
     with open(r'test2.doc', 'a',encoding='utf-8') as f: #生成doc文件的文件名及路徑
      results = x.get_text()
      f.write(results)
      f.write('\n')
  print('對象數(shù)量：\n','頁面數(shù)：%s\n'%num_page,'圖片數(shù)：%s\n'%num_image,'曲線數(shù)：%s\n'%num_curve,'水平文本框：%s\n'
    %num_TextBoxHorizontal)

if __name__ == '__main__':
 pdf_path = r'diya.pdf' #pdf文件路徑及文件名
 parse(pdf_path)

到此這篇關(guān)于Python實(shí)現(xiàn)PDF轉(zhuǎn)Word的方法詳解的文章就介紹到這了,更多相關(guān)Python PDF轉(zhuǎn)Word內(nèi)容請搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家！

您可能感興趣的文章: