使用Python將Word文檔導出為PDF格式并從Word文檔中提取數(shù)據(jù)

更新時間：2025年11月26日 09:30:48 作者：諸神緘默不語

在工作中,經(jīng)常會遇到需要把 Word 文檔轉換成 PDF 的情況,比如生成報表、分發(fā)文檔、或者做歸檔保存,PDF 格式在排版和跨平臺顯示上更穩(wěn)定,因此本文給大家介紹了如何使用Python將Word文檔導出為PDF格式并從Word文檔中提取數(shù)據(jù),需要的朋友可以參考下

1. 將Word文檔導出為PDF文檔

準備環(huán)境

Python版本需要為3.8-3.14（pywin32的要求）
通過pip下載pywin32：

pip install pywin32

復制即可運行的完整代碼

import os
from win32com import client as wc

def convert_word_to_pdf(word_path: str) -> str:
    """
    將單個Word文檔(doc/docx)轉換為PDF格式，并返回轉換后的文件路徑
    
    Args:
        word_path (str): Word文件路徑
        
    Returns:
        str: 轉換后的PDF文件路徑
        
    Raises:
        FileNotFoundError: 如果文件不存在
        ValueError: 如果文件不是Word格式或路徑無效
        Exception: 轉換過程中的其他錯誤
    """
    # 參數(shù)驗證
    if not word_path or not isinstance(word_path, str):
        raise ValueError("文件路徑不能為空且必須是字符串")
    
    # 檢查文件是否存在
    if not os.path.exists(word_path):
        raise FileNotFoundError(f"文件不存在: {word_path}")
    
    # 檢查文件擴展名
    valid_extensions = ('.doc', '.docx')
    if not word_path.lower().endswith(valid_extensions):
        raise ValueError(f"文件不是Word格式(.doc/.docx): {word_path}")
    
    # 檢查文件是否被占用（臨時文件）
    if os.path.basename(word_path).startswith('~'):
        raise ValueError(f"文件可能是臨時文件: {word_path}")
    
    # 檢查文件大?。ū苊馓幚砜瘴募驌p壞文件）
    file_size = os.path.getsize(word_path)
    if file_size == 0:
        raise ValueError(f"文件為空: {word_path}")
    
    word = None
    doc = None
    
    try:
        # 創(chuàng)建Word應用實例
        word = wc.Dispatch("Word.Application")
        word.Visible = False
        
        # 打開文檔
        doc = word.Documents.Open(word_path)
        
        # 生成新的PDF文件路徑
        base_path = os.path.splitext(word_path)[0]
        new_path = base_path + ".pdf"
        
        # 處理文件名沖突
        count = 0
        while os.path.exists(new_path):
            count += 1
            new_path = f"{base_path}({count}).pdf"
        
        # 保存為PDF格式（17代表PDF格式）
        doc.SaveAs(new_path, 17)
        
        # 驗證轉換后的文件
        if not os.path.exists(new_path):
            raise Exception("轉換后的PDF文件未創(chuàng)建成功")
        
        # 檢查轉換后的文件大小
        if os.path.getsize(new_path) == 0:
            raise Exception("轉換后的PDF文件為空")
        
        return new_path
        
    except Exception as e:
        # 清理可能創(chuàng)建的部分文件
        if 'new_path' in locals() and os.path.exists(new_path):
            try:
                os.remove(new_path)
            except:
                pass
        raise e
        
    finally:
        # 確保資源被正確釋放
        try:
            if doc:
                doc.Close()
        except:
            pass
            
        try:
            if word:
                word.Quit()
        except:
            pass


def main():
    """主函數(shù)：轉換單個文件并顯示結果"""
    # 直接在代碼中設置要轉換的文件路徑
    word_file_path = r"D:\word_files\example.docx"  # 修改為您要轉換的Word文件路徑
    
    try:
        print(f"開始轉換文件: {word_file_path}")
        
        # 轉換文件
        pdf_path = convert_word_to_pdf(word_file_path)
        
        # 輸出成功信息
        print("\n" + "="*50)
        print("? 文件轉換成功！")
        print(f"原始文件: {word_file_path}")
        print(f"轉換后文件: {pdf_path}")
        print(f"文件大小: {os.path.getsize(pdf_path)} 字節(jié)")
        print("="*50)
        
    except FileNotFoundError as e:
        print(f"? 錯誤: {e}")
        print("請檢查文件路徑是否正確")
    except ValueError as e:
        print(f"? 錯誤: {e}")
        print("請確保文件是有效的Word文檔(.doc/.docx)")
    except Exception as e:
        print(f"? 轉換失敗: {e}")
        print("可能是Word應用問題或文件損壞")


if __name__ == "__main__":
    main()

將代碼復制到你的Python編輯器中，并修改D:\word_files\example.docx為您需要轉換的Word文件路徑即可。轉換后的PDF文件與Word文件同名（如果已經(jīng)存在了同名PDF文件，將加上(count)后綴以避免沖突）

核心功能代碼

Word文檔導出為PDF的核心功能代碼為：

import os
from win32com import client as wc

word = wc.Dispatch("Word.Application")
word.Visible = False

doc = word.Documents.Open(word_path)
doc.SaveAs(new_path, 17)
doc.Close()

word.Quit()

將word_path對應的Word文檔轉換為new_path對應的PDF文檔。支持doc和docx格式的Word文檔。

doc.SaveAs(file_name, file_format)中file_format這個參數(shù)表示另存為文檔的格式。

17: wdFormatPDF - PDF格式
0: wdFormatDocument - Microsoft Office Word 97-2003文檔格式(.doc)
16: wdFormatDocumentDefault - Word默認文檔格式(.docx)
7: wdFormatPDF - PDF格式（與17相同）
8: wdFormatXPS - XPS格式

2. 如何用Python從Word中提取數(shù)據(jù)：以處理簡歷為例

這個簡單案例的數(shù)據(jù)是隨機生成的10個格式相同的簡歷：

在實際工作中遇到的簡歷形式會更加多樣化，需要根據(jù)實際情況來進行分析，甚至可能需要加入適當智能能力。我在此提供的只是一個簡單樣例，執(zhí)行代碼會解析目錄下的所有Word格式簡歷：從表格中提取基本信息和教育背景，基本信息從表格中指定元素的位置獲取，教育背景從表格中逐行獲取；工作技能、技能特長、自我評價根據(jù)特定小標題和段落格式獲取，都基本根據(jù)上圖中這種格式來進行獲取。最后的解析結果會保存到CSV文件中（CSV文件可以用Excel或WPS表格直接打開）。
如果讀者在實際工作中遇到了更復雜的需求，也可以通過留言或私聊的方式咨詢我獲得答疑。
本案例生成測試數(shù)據(jù)的代碼已經(jīng)放在了下文中。由于是隨機生成的，所以每次生成的數(shù)據(jù)都是不同的。如果您想要在本次案例中的數(shù)據(jù)，可以私聊我獲取文件壓縮包。

準備環(huán)境

通過pip下載python-docx：

pip install python-docx

準備測試數(shù)據(jù)

通過pip下載faker：

pip install faker

執(zhí)行代碼（簡歷文件會下載到工作路徑下的resumes文件夾下）：
（如果您需要執(zhí)行這一部分代碼的話，還需要注意的是，生成的Word文件默認使用中文宋體+英文Times New Roman格式，一般以中文為默認語言的Windows電腦中都已內置了這兩種字體，如果您的電腦設置為其它語言需要注意）

import os
import random
from docx import Document
from docx.shared import Pt
from docx.enum.text import WD_ALIGN_PARAGRAPH
from docx.oxml.ns import qn

from faker import Faker

def set_document_font(doc):
    """
    設置文檔的默認字體為宋體和Times New Roman
    """
    # 設置全局樣式
    style = doc.styles['Normal']
    font = style.font
    font.name = 'Times New Roman'
    font._element.rPr.rFonts.set(qn('w:eastAsia'), u'宋體')
    font.size = Pt(12)

def set_paragraph_font(paragraph, bold=False, size=12):
    """
    設置段落的字體
    """
    for run in paragraph.runs:
        run.font.name = 'Times New Roman'
        run._element.rPr.rFonts.set(qn('w:eastAsia'), u'宋體')
        run.font.size = Pt(size)
        run.font.bold = bold

def generate_resume_docx(filename, resume_data):
    """
    生成單個簡歷Word文檔
    
    Args:
        filename (str): 保存的文件名
        resume_data (dict): 簡歷數(shù)據(jù)
    """
    doc = Document()
    
    # 設置文檔字體
    set_document_font(doc)
    
    # 添加標題
    title = doc.add_heading('個人簡歷', 0)
    title.alignment = WD_ALIGN_PARAGRAPH.CENTER
    set_paragraph_font(title, bold=True, size=16)
    
    # 添加個人信息表格
    doc.add_heading('基本信息', level=1)
    table = doc.add_table(rows=6, cols=4)
    table.style = 'Table Grid'
    
    # 填充基本信息表格
    info_cells = [
        ("姓名", resume_data['name']),
        ("性別", resume_data['gender']),
        ("年齡", str(resume_data['age'])),
        ("聯(lián)系電話", resume_data['phone']),
        ("郵箱", resume_data['email']),
        ("現(xiàn)居地", resume_data['address'])
    ]
    
    for i, (label, value) in enumerate(info_cells):
        table.cell(i, 0).text = label
        table.cell(i, 1).text = value
        table.cell(i, 2).text = "求職意向" if i == 0 else ""
        table.cell(i, 3).text = resume_data['job_intention'] if i == 0 else ""
        
        # 設置表格單元格字體
        for j in range(4):
            for paragraph in table.cell(i, j).paragraphs:
                set_paragraph_font(paragraph, bold=(j == 0 or j == 2))
    
    # 教育背景
    doc.add_heading('教育背景', level=1)
    edu_table = doc.add_table(rows=2, cols=4)
    edu_table.style = 'Table Grid'
    
    # 表頭
    headers = ["時間", "學校", "專業(yè)", "學歷"]
    for i, header in enumerate(headers):
        edu_table.cell(0, i).text = header
        for paragraph in edu_table.cell(0, i).paragraphs:
            set_paragraph_font(paragraph, bold=True)
    
    # 教育信息
    edu_table.cell(1, 0).text = resume_data['education']['period']
    edu_table.cell(1, 1).text = resume_data['education']['school']
    edu_table.cell(1, 2).text = resume_data['education']['major']
    edu_table.cell(1, 3).text = resume_data['education']['degree']
    
    # 設置教育信息行的字體
    for i in range(4):
        for paragraph in edu_table.cell(1, i).paragraphs:
            set_paragraph_font(paragraph)
    
    # 工作經(jīng)歷
    doc.add_heading('工作經(jīng)歷', level=1)
    for exp in resume_data['work_experience']:
        p = doc.add_paragraph()
        company_run = p.add_run(f"{exp['company']} | ")
        company_run.bold = True
        position_run = p.add_run(f"{exp['position']} | ")
        position_run.bold = True
        period_run = p.add_run(exp['period'])
        period_run.bold = True
        
        set_paragraph_font(p, bold=True)
        
        desc_para = doc.add_paragraph(exp['description'])
        set_paragraph_font(desc_para)
    
    # 技能特長
    doc.add_heading('技能特長', level=1)
    skills_para = doc.add_paragraph()
    for skill in resume_data['skills']:
        skills_para.add_run(f"? {skill}\n")
    set_paragraph_font(skills_para)
    
    # 自我評價
    doc.add_heading('自我評價', level=1)
    self_eval_para = doc.add_paragraph(resume_data['self_evaluation'])
    set_paragraph_font(self_eval_para)
    
    # 保存文檔
    doc.save(filename)
    print(f"已生成簡歷: {filename}")

def generate_sample_resumes(num=10):
    """
    生成指定數(shù)量的模擬簡歷
    
    Args:
        num (int): 簡歷數(shù)量，默認為10
    """
    fake = Faker('zh_CN')
    
    # 創(chuàng)建保存目錄 - 使用os.path.join處理路徑
    resume_dir = os.path.join('resumes')
    os.makedirs(resume_dir, exist_ok=True)
    
    # 職位列表
    jobs = ['軟件工程師', '數(shù)據(jù)分析師', '產(chǎn)品經(jīng)理', 'UI設計師', '市場營銷', '人力資源', '財務專員', '運營專員']
    
    # 技能列表
    skill_sets = {
        '軟件工程師': ['Python', 'Java', 'SQL', 'Linux', 'Git', 'Docker'],
        '數(shù)據(jù)分析師': ['Python', 'SQL', 'Excel', 'Tableau', '統(tǒng)計學', '機器學習'],
        '產(chǎn)品經(jīng)理': ['Axure', 'Visio', '項目管理', '需求分析', '用戶研究', 'PRD編寫'],
        'UI設計師': ['Photoshop', 'Sketch', 'Figma', 'Illustrator', '用戶體驗設計', '交互設計'],
        '市場營銷': ['市場分析', '營銷策劃', '社交媒體運營', '內容創(chuàng)作', '數(shù)據(jù)分析', '品牌管理'],
        '人力資源': ['招聘', '培訓', '績效管理', '員工關系', 'HR系統(tǒng)', '勞動法'],
        '財務專員': ['會計', '財務報表', '稅務', '成本控制', '財務分析', 'ERP系統(tǒng)'],
        '運營專員': ['內容運營', '用戶運營', '活動策劃', '數(shù)據(jù)分析', '社交媒體', 'SEO/SEM']
    }
    
    # 生成簡歷
    for i in range(num):
        # 隨機選擇一個職位
        job = random.choice(jobs)
        
        # 生成簡歷數(shù)據(jù)
        resume_data = {
            'name': fake.name(),
            'gender': random.choice(['男', '女']),
            'age': random.randint(22, 35),
            'phone': fake.phone_number(),
            'email': fake.email(),
            'address': fake.city(),
            'job_intention': job,
            'education': {
                'period': f"{fake.year()}-{fake.year()}",
                'school': fake.company() + "大學",
                'major': fake.job() + "專業(yè)",
                'degree': random.choice(['本科', '碩士', '博士'])
            },
            'work_experience': [],
            'skills': random.sample(skill_sets[job], random.randint(4, 6)),
            'self_evaluation': fake.paragraph(nb_sentences=3)
        }
        
        # 生成工作經(jīng)歷
        num_experiences = random.randint(1, 3)
        for j in range(num_experiences):
            start_year = 2020 - j * 2
            resume_data['work_experience'].append({
                'company': fake.company(),
                'position': job,
                'period': f"{start_year}-{start_year+2}",
                'description': fake.paragraph(nb_sentences=2)
            })
        
        # 生成文檔 - 使用os.path.join處理路徑
        safe_name = resume_data['name'].replace(' ', '_')
        filename = os.path.join(resume_dir, f"簡歷_{safe_name}_{job}.docx")
        generate_resume_docx(filename, resume_data)

if __name__ == "__main__":
    print("開始生成模擬簡歷...")
    generate_sample_resumes(10)
    print("簡歷生成完成！")

復制即可運行的完整代碼

簡歷Word文檔在resumes文件夾中，生成的resumes_data.csv就直接在工作路徑下：

import os
import re
import csv
from docx import Document

def extract_resume_info(filepath):
    """
    從Word簡歷中提取信息
    
    Args:
        filepath (str): Word文檔路徑
        
    Returns:
        dict: 提取的簡歷信息
    """
    try:
        doc = Document(filepath)
        resume_data = {
            'filename': os.path.basename(filepath),
            'name': '',
            'gender': '',
            'age': '',
            'phone': '',
            'email': '',
            'address': '',
            'job_intention': '',
            'education_period': '',
            'education_school': '',
            'education_major': '',
            'education_degree': '',
            'work_experience': '',
            'skills': '',
            'self_evaluation': ''
        }
        
        # 提取基本信息
        basic_info_extracted = False
        education_extracted = False
        in_work_experience = False
        in_skills = False
        in_self_evaluation = False
        
        work_experiences = []
        skills_list = []
        
        for i, paragraph in enumerate(doc.paragraphs):
            text = paragraph.text.strip()
            
            # 跳過空段落
            if not text:
                continue
                
            # 提取基本信息表格
            if text == '基本信息' and not basic_info_extracted:
                # 查找所有表格
                for table in doc.tables:
                    # 檢查表格是否包含基本信息
                    try:
                        if table.cell(0, 0).text == '姓名':
                            resume_data['name'] = table.cell(0, 1).text
                            resume_data['gender'] = table.cell(1, 1).text
                            resume_data['age'] = table.cell(2, 1).text
                            resume_data['phone'] = table.cell(3, 1).text
                            resume_data['email'] = table.cell(4, 1).text
                            resume_data['address'] = table.cell(5, 1).text
                            resume_data['job_intention'] = table.cell(0, 3).text
                            basic_info_extracted = True
                            break
                    except:
                        continue
                
                if not basic_info_extracted:
                    # 如果沒有找到表格，嘗試從段落中提取
                    extract_basic_info_from_text(doc, resume_data)
                    basic_info_extracted = True
            
            # 提取教育背景
            elif text == '教育背景' and not education_extracted:
                # 查找教育背景表格
                for table in doc.tables:
                    try:
                        if table.cell(0, 0).text == '時間' and '學校' in table.cell(0, 1).text:
                            resume_data['education_period'] = table.cell(1, 0).text
                            resume_data['education_school'] = table.cell(1, 1).text
                            resume_data['education_major'] = table.cell(1, 2).text
                            resume_data['education_degree'] = table.cell(1, 3).text
                            education_extracted = True
                            break
                    except:
                        continue
            
            # 提取工作經(jīng)歷
            elif text == '工作經(jīng)歷':
                in_work_experience = True
                continue
            elif in_work_experience and text and not text.startswith('技能特長') and not text.startswith('自我評價'):
                # 檢查是否是工作經(jīng)歷標題（包含公司、職位、時間，用 | 分隔）
                if ' | ' in text:
                    parts = text.split(' | ')
                    if len(parts) >= 3:
                        # 提取公司、職位和時間
                        company = parts[0].strip()
                        position = parts[1].strip()
                        period = parts[2].strip()
                        
                        work_experiences.append({
                            'company': company,
                            'position': position,
                            'period': period
                        })
                # 否則，可能是工作描述，但我們主要關注標題信息
            
            # 提取技能特長
            elif text == '技能特長':
                in_skills = True
                in_work_experience = False
                continue
            elif in_skills and text and not text.startswith('自我評價'):
                # 技能特長部分是以 ? 開頭的列表項
                lines = text.split('\n')
                for line in lines:
                    line = line.strip()
                    if line.startswith('?'):
                        skill = line[1:].strip()  # 移除 ? 符號
                        if skill:
                            skills_list.append(skill)
            
            # 提取自我評價
            elif text == '自我評價':
                in_self_evaluation = True
                in_skills = False
                continue
            elif in_self_evaluation and text:
                resume_data['self_evaluation'] = text
                in_self_evaluation = False
        
        # 如果沒有提取到基本信息，嘗試其他方法
        if not basic_info_extracted:
            extract_basic_info_from_text(doc, resume_data)
        
        # 將工作經(jīng)歷和技能列表轉換為字符串
        if work_experiences:
            work_exp_str = " | ".join([
                f"{exp['company']} ({exp['position']}, {exp['period']})" 
                for exp in work_experiences
            ])
            resume_data['work_experience'] = work_exp_str
        
        if skills_list:
            resume_data['skills'] = "; ".join(skills_list)
        
        # 清理數(shù)據(jù)
        clean_resume_data(resume_data)
        
        return resume_data
        
    except Exception as e:
        print(f"提取簡歷信息時出錯 {filepath}: {e}")
        return None

def extract_basic_info_from_text(doc, resume_data):
    """
    從文檔文本中提取基本信息
    """
    for paragraph in doc.paragraphs:
        text = paragraph.text
        
        # 提取姓名
        if not resume_data['name'] and len(text) <= 4 and not any(keyword in text for keyword in ['基本信息', '教育背景', '工作經(jīng)歷']):
            resume_data['name'] = text
        
        # 提取電話
        phone_match = re.search(r'1[3-9]\d{9}', text)
        if phone_match and not resume_data['phone']:
            resume_data['phone'] = phone_match.group()
        
        # 提取郵箱
        email_match = re.search(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', text)
        if email_match and not resume_data['email']:
            resume_data['email'] = email_match.group()

def clean_resume_data(resume_data):
    """
    清理提取的簡歷數(shù)據(jù)
    """
    # 清理年齡，只保留數(shù)字
    if resume_data['age']:
        age_match = re.search(r'\d+', resume_data['age'])
        if age_match:
            resume_data['age'] = age_match.group()

def save_to_csv(resumes, csv_filename='resumes_data.csv'):
    """
    將簡歷數(shù)據(jù)保存為CSV文件
    
    Args:
        resumes (list): 簡歷數(shù)據(jù)列表
        csv_filename (str): CSV文件名
    """
    if not resumes:
        print("沒有簡歷數(shù)據(jù)可保存")
        return
    
    # 定義CSV列名
    fieldnames = [
        'filename', 'name', 'gender', 'age', 'phone', 'email', 'address', 
        'job_intention', 'education_period', 'education_school', 'education_major', 
        'education_degree', 'work_experience', 'skills', 'self_evaluation'
    ]
    
    with open(csv_filename, 'w', newline='', encoding='utf-8-sig') as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()
        
        for resume in resumes:
            writer.writerow(resume)
    
    print(f"簡歷數(shù)據(jù)已保存到: {csv_filename}")

def analyze_resumes(directory='resumes'):
    """
    分析目錄中的所有簡歷
    
    Args:
        directory (str): 簡歷目錄路徑
    """
    if not os.path.exists(directory):
        print(f"目錄不存在: {directory}")
        return
    
    resumes = []
    
    # 遍歷目錄中的所有Word文檔
    for filename in os.listdir(directory):
        if filename.endswith('.docx'):
            filepath = os.path.join(directory, filename)
            print(f"正在處理: {filename}")
            
            resume_data = extract_resume_info(filepath)
            if resume_data:
                resumes.append(resume_data)
    
    # 輸出分析結果
    print("\n" + "="*50)
    print("簡歷分析結果")
    print("="*50)
    
    # 統(tǒng)計基本信息
    print(f"共處理簡歷: {len(resumes)} 份")
    
    if not resumes:
        return
    
    # 按職位意向統(tǒng)計
    job_counts = {}
    for resume in resumes:
        job = resume.get('job_intention', '未知')
        job_counts[job] = job_counts.get(job, 0) + 1
    
    print("\n職位意向分布:")
    for job, count in job_counts.items():
        print(f"  {job}: {count} 人")
    
    # 平均年齡
    ages = []
    for resume in resumes:
        if resume['age'] and resume['age'].isdigit():
            ages.append(int(resume['age']))
    if ages:
        avg_age = sum(ages) / len(ages)
        print(f"\n平均年齡: {avg_age:.1f} 歲")
    
    # 學歷分布
    degree_counts = {}
    for resume in resumes:
        degree = resume.get('education_degree', '未知')
        degree_counts[degree] = degree_counts.get(degree, 0) + 1
    
    print("\n學歷分布:")
    for degree, count in degree_counts.items():
        print(f"  {degree}: {count} 人")
    
    # 工作經(jīng)歷統(tǒng)計
    work_exp_counts = {}
    for resume in resumes:
        work_exp = resume.get('work_experience', '')
        if work_exp:
            # 計算工作經(jīng)歷數(shù)量（通過 | 分隔符）
            count = work_exp.count('|') + 1
            work_exp_counts[count] = work_exp_counts.get(count, 0) + 1
    
    print("\n工作經(jīng)歷數(shù)量分布:")
    for count, freq in sorted(work_exp_counts.items()):
        print(f"  {count} 段經(jīng)歷: {freq} 人")
    
    # 技能統(tǒng)計
    skill_counts = {}
    for resume in resumes:
        skills_str = resume.get('skills', '')
        if skills_str:
            # 按分號分隔技能
            skills = [skill.strip() for skill in skills_str.split(';') if skill.strip()]
            for skill in skills:
                skill_counts[skill] = skill_counts.get(skill, 0) + 1
    
    # 取前10個最常用技能
    top_skills = sorted(skill_counts.items(), key=lambda x: x[1], reverse=True)[:10]
    print("\n最常用技能(前10):")
    for skill, count in top_skills:
        print(f"  {skill}: {count} 次")
    
    # 保存詳細數(shù)據(jù)到CSV文件
    save_to_csv(resumes, 'resumes_data.csv')
    
    print(f"\n詳細數(shù)據(jù)已保存到: resumes_data.csv")

if __name__ == "__main__":
    print("開始提取簡歷信息...")
    analyze_resumes()
    print("簡歷分析完成！")

生成的resumes_data.csv為：

在生成過程中還會在終端輸出：

開始提取簡歷信息...
正在處理: 簡歷_劉桂花_數(shù)據(jù)分析師.docx
正在處理: 簡歷_吳婷_數(shù)據(jù)分析師.docx
正在處理: 簡歷_吳建平_UI設計師.docx
正在處理: 簡歷_孫婷_軟件工程師.docx
正在處理: 簡歷_張麗麗_市場營銷.docx
正在處理: 簡歷_李建軍_運營專員.docx
正在處理: 簡歷_歐寧_數(shù)據(jù)分析師.docx
正在處理: 簡歷_王利_運營專員.docx
正在處理: 簡歷_羅歡_運營專員.docx
正在處理: 簡歷_韓巖_財務專員.docx

==================================================
簡歷分析結果
==================================================
共處理簡歷: 10 份

職位意向分布:
  數(shù)據(jù)分析師: 3 人
  UI設計師: 1 人
  軟件工程師: 1 人
  市場營銷: 1 人
  運營專員: 3 人
  財務專員: 1 人

平均年齡: 28.3 歲

學歷分布:
  博士: 4 人
  碩士: 3 人
  本科: 3 人

工作經(jīng)歷數(shù)量分布:
  1 段經(jīng)歷: 2 人
  2 段經(jīng)歷: 4 人
  3 段經(jīng)歷: 4 人

最常用技能(前10):
  數(shù)據(jù)分析: 4 次
  Excel: 3 次
  機器學習: 3 次
  SQL: 3 次
  Python: 3 次
  用戶運營: 3 次
  活動策劃: 3 次
  SEO/SEM: 3 次
  Tableau: 2 次
  統(tǒng)計學: 2 次
簡歷數(shù)據(jù)已保存到: resumes_data.csv

詳細數(shù)據(jù)已保存到: resumes_data.csv
簡歷分析完成！

核心功能代碼

python-docx包

引入環(huán)境：from docx import Document
初始化文檔對象：doc=Document(filepath)
Document對象就是一個Word文檔對象，由一個個段落（paragraph）組成。paragraph中的屬性text就是文本（字符串）
從Document對象中，也可以通過tables屬性獲得表格列表，表格是Table對象。表格可以通過cell(行,列)（行列數(shù)都從0開始）獲取單元格，每個單元格的屬性text就是文本（字符串）

csv包

初始化寫入對象：

writer = csv.DictWriter(csv文件流, fieldnames=列名列表)
writer.writeheader()

寫入一行：writer.writerow(dict) 傳入字典參數(shù)，以列名為鍵，值會寫入CSV

os包

從整個文件路徑里獲取文件名：os.path.basename(filepath)
獲取文件夾下所有文件的文件名：os.listdir(directory)
將文件夾路徑和文件名組合為文件路徑：os.path.join(directory, filename)（事實上這個函數(shù)可以疊好幾層路徑，可以從父文件夾疊子文件夾名再疊文件名這樣疊成完整路徑）

re包（正則表達式）

對正則表達式的詳細介紹不在本專欄涉及的范圍中。

re.search(pattern,text)：搜索文本中第一個符合指定正則表達式pattern的文本，返回值的group()函數(shù)返回匹配的字符串

phone_match = re.search(r'1[3-9]\d{9}', text) 匹配中國大陸手機號碼格式
總長度：11位數(shù)字
第一位：必須是1
第二位：必須是3-9（排除12開頭的特殊號碼）
后面9位：任意數(shù)字
email_match = re.search(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', text) 匹配標準郵箱格式：用戶名@域名.頂級域名
用戶名：字母、數(shù)字、特殊字符
域名：字母、數(shù)字、點、減號
頂級域名：至少2個字母

Python對象處理

字符串處理

strip()：去除文本開頭與結尾的空格、回車等控制符
startswith(str)：文本開頭是否是特定字符串
split(分隔符)：將文本用分隔符切分開。如果不顯式設置分隔符，默認用空格、回車等控制符來切分
join(list)：將字符串列表組合成一個字符串，用對象文本作為連接詞
isdigit()：如果字符串中所有字符都是數(shù)字，返回True

容器對象處理

列表
1. 直接通過索引切片（[index]）獲取對象
2. append(obj)：添加一個對象到列表末尾
字典

通過鍵值對格式可以直接創(chuàng)建字典對象：

{
    'company': company,
    'position': position,
    'period': period
}

通過鍵名切片可以直接獲取值和賦值（如果鍵名不存在，會直接創(chuàng)建鍵值對），如resume_data['self_evaluation']
通過get(key,default_value) 獲取值，在鍵（key）不存在時可以返回一個默認值（default_value）
any(obj)如果obj中任何一個元素為True，就返回True

以上就是使用Python將Word文檔導出為PDF格式并從Word文檔中提取數(shù)據(jù)的詳細內容，更多關于Python Word導出為PDF并提取數(shù)據(jù)的資料請關注腳本之家其它相關文章！

您可能感興趣的文章:

Python批量刪除txt文本指定行的思路與代碼
在深度學習項目中常常會處理各種數(shù)據(jù)集,下面這篇文章主要給大家介紹了關于Python批量刪除txt文本指定行的思路與代碼,文中通過實例代碼介紹的非常詳細,需要的朋友可以參考下
2023-02-02
python實現(xiàn)飛機大戰(zhàn)小游戲
這篇文章主要為大家詳細介紹了python實現(xiàn)飛機大戰(zhàn)游戲，文中示例代碼介紹的非常詳細，具有一定的參考價值，感興趣的小伙伴們可以參考一下
2019-11-11
flask框架實現(xiàn)連接sqlite3數(shù)據(jù)庫的方法分析
這篇文章主要介紹了flask框架實現(xiàn)連接sqlite3數(shù)據(jù)庫的方法,結合實例形式分析了flask框架連接sqlite3數(shù)據(jù)庫的具體操作步驟與相關實現(xiàn)技巧,需要的朋友可以參考下
2018-07-07
Python字符串替換實例分析
這篇文章主要介紹了Python字符串替換的方法,實例對比分析了單個字符替換與字符串替換的相關技巧,非常簡單實用,需要的朋友可以參考下
2015-05-05
Python使用POP3和SMTP協(xié)議收發(fā)郵件的示例代碼
這篇文章主要介紹了Python使用POP3和SMTP協(xié)議收發(fā)郵件的示例代碼，文中通過示例代碼介紹的非常詳細，對大家的學習或者工作具有一定的參考學習價值，需要的朋友們下面隨著小編來一起學習學習吧
2019-04-04
django項目用higcharts統(tǒng)計最近七天文章點擊量
這篇文章主要介紹了django項目用higcharts統(tǒng)計最近七天文章點擊量,本文給大家介紹的非常詳細，具有一定的參考借鑒價值，需要的朋友可以參考下
2019-08-08
python機器學習樸素貝葉斯算法及模型的選擇和調優(yōu)詳解
這篇文章主要為大家介紹了python機器學習樸素貝葉斯及模型的選擇和調優(yōu)示例詳解，有需要的朋友可以借鑒參考下，希望能夠有所幫助，祝大家多多進步
2021-11-11
python處理SQLite數(shù)據(jù)庫的方法
這篇文章主要介紹了python處理SQLite數(shù)據(jù)庫的方法，python處理數(shù)據(jù)庫非常簡單。而且不同類型的數(shù)據(jù)庫處理邏輯方式大同小異。本文以sqlite數(shù)據(jù)庫為例，介紹一下python操作數(shù)據(jù)庫的方,需要的朋友可以參考下，希望能幫助到大家
2022-02-02
python Selenium等待元素出現(xiàn)的具體方法
在本篇文章里小編給大家分享的是一篇關于python Selenium等待元素出現(xiàn)的具體方法，以后需要的朋友們可以學習參考下。
2021-08-08
詳解Python函數(shù)中的幾種參數(shù)
這篇文章主要為大家介紹了Python參數(shù)的使用，具有一定的參考價值，感興趣的小伙伴們可以參考一下，希望能夠給你帶來幫助
2021-12-12

国产无遮挡裸体免费直播视频,久久精品国产蜜臀av,动漫在线视频一区二区,欧亚日韩一区二区三区,久艹在线免费视频,国产精品美女网站免费,正在播放 97超级视频在线观看,斗破苍穹年番在线观看免费,51最新乱码中文字幕

使用Python將Word文檔導出為PDF格式并從Word文檔中提取數(shù)據(jù)

目錄

1. 將Word文檔導出為PDF文檔

準備環(huán)境

復制即可運行的完整代碼

核心功能代碼

2. 如何用Python從Word中提取數(shù)據(jù)：以處理簡歷為例

準備環(huán)境

準備測試數(shù)據(jù)

復制即可運行的完整代碼

核心功能代碼

python-docx包

csv包

os包

re包（正則表達式）

Python對象處理

相關文章

最新評論

大家感興趣的內容

最近更新的內容

常用在線小工具

国产无遮挡裸体免费直播视频,久久精品国产蜜臀av,动漫在线视频一区二区,欧亚日韩一区二区三区,久艹在线 免费视频,国产精品美女网站免费,正在播放 97超级视频在线观看,斗破苍穹年番在线观看免费,51最新乱码中文字幕

使用Python將Word文檔導出為PDF格式并從Word文檔中提取數(shù)據(jù)

目錄

1. 將Word文檔導出為PDF文檔

準備環(huán)境

復制即可運行的完整代碼

核心功能代碼

2. 如何用Python從Word中提取數(shù)據(jù)：以處理簡歷為例

準備環(huán)境

準備測試數(shù)據(jù)

復制即可運行的完整代碼

核心功能代碼

python-docx包

csv包

os包

re包（正則表達式）

Python對象處理

相關文章

最新評論

大家感興趣的內容

最近更新的內容

常用在線小工具

国产无遮挡裸体免费直播视频,久久精品国产蜜臀av,动漫在线视频一区二区,欧亚日韩一区二区三区,久艹在线免费视频,国产精品美女网站免费,正在播放 97超级视频在线观看,斗破苍穹年番在线观看免费,51最新乱码中文字幕