Python實(shí)現(xiàn)從多個(gè)數(shù)據(jù)源(CSV,Excel,SQL)自動(dòng)整合數(shù)據(jù)

更新時(shí)間：2025年10月27日 08:55:59 作者：閑人編程

這篇文章主要為大家詳細(xì)介紹了Python實(shí)現(xiàn)從多個(gè)數(shù)據(jù)源(CSV,Excel,SQL)自動(dòng)整合數(shù)據(jù),文中的示例代碼講解詳細(xì),有需要的可以了解下

1. 引言：數(shù)據(jù)整合的挑戰(zhàn)與重要性

1.1 現(xiàn)代數(shù)據(jù)分析中的數(shù)據(jù)孤島問(wèn)題

在當(dāng)今數(shù)據(jù)驅(qū)動(dòng)的商業(yè)環(huán)境中，組織通常面臨數(shù)據(jù)分散的挑戰(zhàn)。根據(jù)行業(yè)調(diào)查，企業(yè)平均使用12-15個(gè)不同的數(shù)據(jù)源來(lái)支持日常運(yùn)營(yíng)和決策制定。這些數(shù)據(jù)源往往以不同的格式存儲(chǔ)在不同的系統(tǒng)中，形成了所謂的"數(shù)據(jù)孤島"。

數(shù)據(jù)孤島帶來(lái)的主要問(wèn)題包括：

信息不一致：相同指標(biāo)在不同系統(tǒng)中可能有不同的計(jì)算結(jié)果
決策延遲：手動(dòng)整合數(shù)據(jù)耗費(fèi)大量時(shí)間，影響決策時(shí)效性
資源浪費(fèi)：數(shù)據(jù)工程師和分析師花費(fèi)大量時(shí)間在數(shù)據(jù)提取和轉(zhuǎn)換上
機(jī)會(huì)成本：無(wú)法快速響應(yīng)市場(chǎng)變化，錯(cuò)失商業(yè)機(jī)會(huì)

1.2 自動(dòng)化數(shù)據(jù)整合的商業(yè)價(jià)值

自動(dòng)化數(shù)據(jù)整合流程能夠?yàn)槠髽I(yè)創(chuàng)造顯著的商業(yè)價(jià)值：

通過(guò)自動(dòng)化數(shù)據(jù)整合，企業(yè)可以實(shí)現(xiàn)：

效率提升：減少70-80%的手動(dòng)數(shù)據(jù)準(zhǔn)備時(shí)間
質(zhì)量改善：標(biāo)準(zhǔn)化數(shù)據(jù)處理流程，減少人為錯(cuò)誤
成本節(jié)約：降低對(duì)專(zhuān)業(yè)數(shù)據(jù)工程師的依賴(lài)
可擴(kuò)展性：輕松適應(yīng)新的數(shù)據(jù)源和業(yè)務(wù)需求

2. 數(shù)據(jù)源特性分析

2.1 不同數(shù)據(jù)源的特性對(duì)比

在開(kāi)始構(gòu)建自動(dòng)化數(shù)據(jù)整合系統(tǒng)之前，我們需要深入理解各種數(shù)據(jù)源的特點(diǎn)：

數(shù)據(jù)源類(lèi)型	優(yōu)勢(shì)	局限性	適用場(chǎng)景
CSV文件	簡(jiǎn)單通用、易于查看、跨平臺(tái)兼容	無(wú)數(shù)據(jù)類(lèi)型約束、性能較差、無(wú)索引	數(shù)據(jù)交換、小型數(shù)據(jù)集、臨時(shí)分析
Excel文件	用戶友好、支持公式、圖表豐富	文件大小限制、性能問(wèn)題、版本兼容性	業(yè)務(wù)報(bào)表、財(cái)務(wù)數(shù)據(jù)、中小型數(shù)據(jù)集
SQL數(shù)據(jù)庫(kù)	事務(wù)支持、數(shù)據(jù)完整性、高性能查詢	需要數(shù)據(jù)庫(kù)知識(shí)、部署復(fù)雜度高	業(yè)務(wù)系統(tǒng)、大型數(shù)據(jù)集、實(shí)時(shí)應(yīng)用

2.2 數(shù)據(jù)質(zhì)量常見(jiàn)問(wèn)題

每種數(shù)據(jù)源都存在特定的數(shù)據(jù)質(zhì)量問(wèn)題：

# 數(shù)據(jù)質(zhì)量檢查框架
class DataQualityFramework:
    """數(shù)據(jù)質(zhì)量評(píng)估框架"""
    
    @staticmethod
    def identify_common_issues(data_source_type):
        """識(shí)別不同數(shù)據(jù)源的常見(jiàn)問(wèn)題"""
        
        issues = {
            'csv': [
                '編碼問(wèn)題（特別是中文）',
                '日期格式不一致',
                '缺失值表示方式多樣',
                '列分隔符不一致',
                '標(biāo)題行位置不固定'
            ],
            'excel': [
                '合并單元格',
                '多工作表結(jié)構(gòu)',
                '公式計(jì)算結(jié)果',
                '隱藏行列',
                '數(shù)據(jù)驗(yàn)證規(guī)則'
            ],
            'sql': [
                '字符集不匹配',
                'NULL值處理',
                '外鍵約束違反',
                '數(shù)據(jù)類(lèi)型轉(zhuǎn)換錯(cuò)誤',
                '時(shí)區(qū)問(wèn)題'
            ]
        }
        
        return issues.get(data_source_type, [])
    
    @staticmethod
    def calculate_data_quality_score(df, data_source_type):
        """計(jì)算數(shù)據(jù)質(zhì)量分?jǐn)?shù)"""
        
        quality_metrics = {
            'completeness': 1 - (df.isnull().sum().sum() / (df.shape[0] * df.shape[1])),
            'consistency': DataQualityFramework.check_consistency(df, data_source_type),
            'accuracy': DataQualityFramework.check_accuracy(df, data_source_type),
            'uniqueness': 1 - (df.duplicated().sum() / len(df))
        }
        
        # 加權(quán)平均計(jì)算總分
        weights = {'completeness': 0.3, 'consistency': 0.3, 'accuracy': 0.2, 'uniqueness': 0.2}
        total_score = sum(quality_metrics[metric] * weights[metric] for metric in quality_metrics)
        
        return total_score, quality_metrics

# 使用示例
quality_checker = DataQualityFramework()
csv_issues = quality_checker.identify_common_issues('csv')
print("CSV文件常見(jiàn)問(wèn)題:", csv_issues)

3. 環(huán)境配置與依賴(lài)管理

3.1 完整的依賴(lài)包配置

構(gòu)建健壯的數(shù)據(jù)整合系統(tǒng)需要精心選擇和管理依賴(lài)包：

# requirements.txt
"""
pandas>=1.5.0
numpy>=1.21.0
openpyxl>=3.0.0
xlrd>=2.0.0
sqlalchemy>=1.4.0
psycopg2-binary>=2.9.0
mysql-connector-python>=8.0.0
pyodbc>=4.0.0
python-dotenv>=0.19.0
loguru>=0.6.0
pydantic>=1.9.0
"""

# 環(huán)境配置和檢查腳本
import sys
import importlib
from typing import Dict, List, Tuple

class EnvironmentValidator:
    """環(huán)境驗(yàn)證器：檢查所有必要的依賴(lài)包"""
    
    REQUIRED_PACKAGES = {
        'pandas': ('數(shù)據(jù)操作', '1.5.0'),
        'numpy': ('數(shù)值計(jì)算', '1.21.0'),
        'openpyxl': ('Excel文件處理', '3.0.0'),
        'sqlalchemy': ('數(shù)據(jù)庫(kù)ORM', '1.4.0'),
        'psycopg2': ('PostgreSQL連接', '2.9.0'),
        'mysql.connector': ('MySQL連接', '8.0.0'),
        'pyodbc': ('ODBC連接', '4.0.0'),
        'python-dotenv': ('環(huán)境變量管理', '0.19.0'),
        'loguru': ('日志記錄', '0.6.0')
    }
    
    OPTIONAL_PACKAGES = {
        'pydantic': ('數(shù)據(jù)驗(yàn)證', '1.9.0'),
        'requests': ('HTTP請(qǐng)求', '2.27.0'),
        'boto3': ('AWS服務(wù)', '1.24.0')
    }
    
    @classmethod
    def validate_environment(cls) -> Tuple[bool, Dict[str, Tuple[bool, str]]]:
        """驗(yàn)證環(huán)境依賴(lài)"""
        
        results = {}
        all_passed = True
        
        print("=" * 60)
        print("環(huán)境依賴(lài)檢查")
        print("=" * 60)
        
        # 檢查必需包
        print("\n必需依賴(lài)包檢查:")
        for package, (description, min_version) in cls.REQUIRED_PACKAGES.items():
            installed, version = cls._check_package(package, min_version)
            status = "?" if installed else "?"
            results[package] = (installed, version)
            
            if not installed:
                all_passed = False
                
            print(f"{status} {package:20} {description:15} 要求版本: {min_version:8} 安裝版本: {version}")
        
        # 檢查可選包
        print("\n可選依賴(lài)包檢查:")
        for package, (description, min_version) in cls.OPTIONAL_PACKAGES.items():
            installed, version = cls._check_package(package, min_version)
            status = "??" if not installed else "?"
            results[package] = (installed, version)
            print(f"{status} {package:20} {description:15} 要求版本: {min_version:8} 安裝版本: {version}")
        
        print(f"\n總體結(jié)果: {'所有依賴(lài)已滿足' if all_passed else '缺少必需依賴(lài)'}")
        return all_passed, results
    
    @staticmethod
    def _check_package(package_name: str, min_version: str) -> Tuple[bool, str]:
        """檢查單個(gè)包的安裝情況和版本"""
        try:
            module = importlib.import_module(package_name.replace('-', '_'))
            installed_version = getattr(module, '__version__', '未知')
            
            # 簡(jiǎn)單的版本比較
            if installed_version != '未知':
                installed_parts = list(map(int, installed_version.split('.')[:3]))
                min_parts = list(map(int, min_version.split('.')[:3]))
                is_compatible = installed_parts >= min_parts
            else:
                is_compatible = True
                
            return is_compatible, installed_version
            
        except ImportError:
            return False, "未安裝"

# 運(yùn)行環(huán)境檢查
if __name__ == "__main__":
    env_ok, package_status = EnvironmentValidator.validate_environment()
    
    if not env_ok:
        print("\n?? 請(qǐng)安裝缺失的依賴(lài)包:")
        for package, (installed, version) in package_status.items():
            if not installed and package in EnvironmentValidator.REQUIRED_PACKAGES:
                print(f"  pip install {package}>={EnvironmentValidator.REQUIRED_PACKAGES[package][1]}")

3.2 配置管理系統(tǒng)

import os
from typing import Dict, Any, Optional
from dotenv import load_dotenv
import json

class ConfigManager:
    """配置管理器：統(tǒng)一管理所有數(shù)據(jù)源配置"""
    
    def __init__(self, config_path: str = None):
        self.config_path = config_path or 'config.json'
        self._load_configuration()
    
    def _load_configuration(self) -> None:
        """加載配置文件和環(huán)境變量"""
        
        # 加載環(huán)境變量
        load_dotenv()
        
        # 加載JSON配置文件
        if os.path.exists(self.config_path):
            with open(self.config_path, 'r', encoding='utf-8') as f:
                self.config = json.load(f)
        else:
            self.config = self._create_default_config()
    
    def _create_default_config(self) -> Dict[str, Any]:
        """創(chuàng)建默認(rèn)配置"""
        
        default_config = {
            "data_sources": {
                "csv": {
                    "default_encoding": "utf-8",
                    "fallback_encodings": ["gbk", "latin1"],
                    "chunk_size": 10000
                },
                "excel": {
                    "engine": "openpyxl",
                    "na_values": ["", "NULL", "N/A", "null"],
                    "keep_default_na": True
                },
                "database": {
                    "timeout": 30,
                    "pool_size": 5,
                    "max_overflow": 10
                }
            },
            "processing": {
                "max_workers": 4,
                "chunk_size": 10000,
                "temp_directory": "./temp"
            },
            "logging": {
                "level": "INFO",
                "format": "{time:YYYY-MM-DD HH:mm:ss} | {level} | {message}",
                "rotation": "10 MB"
            }
        }
        
        # 保存默認(rèn)配置
        with open(self.config_path, 'w', encoding='utf-8') as f:
            json.dump(default_config, f, indent=2, ensure_ascii=False)
        
        print(f"已創(chuàng)建默認(rèn)配置文件: {self.config_path}")
        return default_config
    
    def get_database_config(self, db_alias: str) -> Dict[str, str]:
        """獲取數(shù)據(jù)庫(kù)配置"""
        
        # 優(yōu)先從環(huán)境變量讀取
        db_config = {
            'host': os.getenv(f'{db_alias.upper()}_HOST'),
            'port': os.getenv(f'{db_alias.upper()}_PORT'),
            'database': os.getenv(f'{db_alias.upper()}_DATABASE'),
            'username': os.getenv(f'{db_alias.upper()}_USERNAME'),
            'password': os.getenv(f'{db_alias.upper()}_PASSWORD')
        }
        
        # 檢查配置完整性
        missing = [key for key, value in db_config.items() if not value]
        if missing:
            raise ValueError(f"數(shù)據(jù)庫(kù) {db_alias} 配置不完整，缺少: {missing}")
        
        return db_config
    
    def get_file_config(self, file_type: str) -> Dict[str, Any]:
        """獲取文件處理配置"""
        return self.config['data_sources'].get(file_type, {})
    
    def update_config(self, section: str, updates: Dict[str, Any]) -> None:
        """更新配置"""
        if section in self.config:
            self.config[section].update(updates)
        else:
            self.config[section] = updates
        
        # 保存更新
        with open(self.config_path, 'w', encoding='utf-8') as f:
            json.dump(self.config, f, indent=2, ensure_ascii=False)

# 初始化配置管理器
config_manager = ConfigManager()

4. 核心數(shù)據(jù)讀取器實(shí)現(xiàn)

基礎(chǔ)數(shù)據(jù)讀取器類(lèi)

import pandas as pd
import numpy as np
from typing import Union, List, Dict, Any, Optional
from pathlib import Path
import logging
from loguru import logger
import chardet

class BaseDataReader:
    """基礎(chǔ)數(shù)據(jù)讀取器：提供通用數(shù)據(jù)讀取功能"""
    
    def __init__(self, config: ConfigManager):
        self.config = config
        self._setup_logging()
    
    def _setup_logging(self) -> None:
        """設(shè)置日志"""
        logging_config = self.config.get_file_config('logging')
        logger.remove()  # 移除默認(rèn)處理器
        logger.add(
            "logs/data_integration_{time:YYYY-MM-DD}.log",
            level=logging_config.get('level', 'INFO'),
            format=logging_config.get('format', '{time:YYYY-MM-DD HH:mm:ss} | {level} | {message}'),
            rotation=logging_config.get('rotation', '10 MB'),
            retention=logging_config.get('retention', '30 days')
        )
    
    def detect_encoding(self, file_path: str) -> str:
        """檢測(cè)文件編碼"""
        try:
            with open(file_path, 'rb') as f:
                raw_data = f.read(10000)  # 讀取前10000字節(jié)進(jìn)行檢測(cè)
                result = chardet.detect(raw_data)
                encoding = result['encoding']
                confidence = result['confidence']
                
                logger.info(f"檢測(cè)到文件編碼: {encoding} (置信度: {confidence:.2f})")
                return encoding if confidence > 0.7 else 'utf-8'
                
        except Exception as e:
            logger.warning(f"編碼檢測(cè)失敗: {e}，使用默認(rèn)編碼")
            return 'utf-8'
    
    def validate_dataframe(self, df: pd.DataFrame, source_name: str) -> bool:
        """驗(yàn)證DataFrame的基本完整性"""
        
        checks = {
            '空DataFrame': len(df) > 0,
            '包含數(shù)據(jù)': not df.empty,
            '列名唯一性': len(df.columns) == len(set(df.columns)),
            '無(wú)全空列': not df.isnull().all().any()
        }
        
        failed_checks = [check for check, passed in checks.items() if not passed]
        
        if failed_checks:
            logger.warning(f"數(shù)據(jù)驗(yàn)證失敗 [{source_name}]: {failed_checks}")
            return False
        
        logger.info(f"數(shù)據(jù)驗(yàn)證通過(guò) [{source_name}]: 形狀 {df.shape}")
        return True
    
    def handle_read_error(self, error: Exception, source: str) -> pd.DataFrame:
        """統(tǒng)一錯(cuò)誤處理"""
        logger.error(f"讀取數(shù)據(jù)源失敗 [{source}]: {error}")
        # 返回空的DataFrame而不是拋出異常
        return pd.DataFrame()

class CSVDataReader(BaseDataReader):
    """CSV文件讀取器"""
    
    def read(self, file_path: str, **kwargs) -> pd.DataFrame:
        """讀取CSV文件"""
        
        try:
            csv_config = self.config.get_file_config('csv')
            
            # 自動(dòng)檢測(cè)編碼
            encoding = kwargs.pop('encoding', None)
            if not encoding:
                encoding = self.detect_encoding(file_path)
            
            # 嘗試多種編碼
            encodings_to_try = [encoding] + csv_config.get('fallback_encodings', [])
            
            for enc in encodings_to_try:
                try:
                    logger.info(f"嘗試使用編碼 {enc} 讀取CSV文件: {file_path}")
                    
                    df = pd.read_csv(
                        file_path,
                        encoding=enc,
                        na_values=csv_config.get('na_values', ['', 'NULL']),
                        keep_default_na=True,
                        **kwargs
                    )
                    
                    if self.validate_dataframe(df, f"CSV: {file_path}"):
                        return df
                        
                except UnicodeDecodeError:
                    logger.warning(f"編碼 {enc} 失敗，嘗試下一個(gè)")
                    continue
                except Exception as e:
                    logger.error(f"讀取CSV文件失敗: {e}")
                    break
            
            return self.handle_read_error(Exception("所有編碼嘗試都失敗"), f"CSV: {file_path}")
            
        except Exception as e:
            return self.handle_read_error(e, f"CSV: {file_path}")
    
    def read_chunked(self, file_path: str, chunk_size: int = None, **kwargs) -> pd.DataFrame:
        """分塊讀取大型CSV文件"""
        
        csv_config = self.config.get_file_config('csv')
        chunk_size = chunk_size or csv_config.get('chunk_size', 10000)
        
        chunks = []
        try:
            for i, chunk in enumerate(pd.read_csv(file_path, chunksize=chunk_size, **kwargs)):
                chunks.append(chunk)
                logger.info(f"讀取第 {i + 1} 塊數(shù)據(jù)，形狀: {chunk.shape}")
                
                # 每10塊輸出一次進(jìn)度
                if (i + 1) % 10 == 0:
                    total_rows = sum(len(c) for c in chunks)
                    logger.info(f"已讀取 {total_rows} 行數(shù)據(jù)")
            
            if chunks:
                result_df = pd.concat(chunks, ignore_index=True)
                logger.info(f"CSV文件讀取完成，總行數(shù): {len(result_df)}")
                return result_df
            else:
                return pd.DataFrame()
                
        except Exception as e:
            return self.handle_read_error(e, f"CSV(chunked): {file_path}")

class ExcelDataReader(BaseDataReader):
    """Excel文件讀取器"""
    
    def read(self, file_path: str, sheet_name: Union[str, int, List] = 0, **kwargs) -> pd.DataFrame:
        """讀取Excel文件"""
        
        try:
            excel_config = self.config.get_file_config('excel')
            
            # 獲取所有工作表名稱(chēng)
            excel_file = pd.ExcelFile(file_path)
            sheet_names = excel_file.sheet_names
            
            logger.info(f"Excel文件包含工作表: {sheet_names}")
            
            if sheet_name is None:
                # 讀取所有工作表
                dfs = {}
                for sheet in sheet_names:
                    df = self._read_single_sheet(file_path, sheet, excel_config, **kwargs)
                    if not df.empty:
                        dfs[sheet] = df
                return dfs
            else:
                # 讀取指定工作表
                return self._read_single_sheet(file_path, sheet_name, excel_config, **kwargs)
                
        except Exception as e:
            return self.handle_read_error(e, f"Excel: {file_path}")
    
    def _read_single_sheet(self, file_path: str, sheet_name: str, config: Dict, **kwargs) -> pd.DataFrame:
        """讀取單個(gè)工作表"""
        
        try:
            df = pd.read_excel(
                file_path,
                sheet_name=sheet_name,
                engine=config.get('engine', 'openpyxl'),
                na_values=config.get('na_values', ['', 'NULL', 'N/A']),
                keep_default_na=config.get('keep_default_na', True),
                **kwargs
            )
            
            if self.validate_dataframe(df, f"Excel[{sheet_name}]: {file_path}"):
                # 清理Excel特有的問(wèn)題
                df = self._clean_excel_data(df)
                return df
            else:
                return pd.DataFrame()
                
        except Exception as e:
            logger.error(f"讀取Excel工作表失敗 [{sheet_name}]: {e}")
            return pd.DataFrame()
    
    def _clean_excel_data(self, df: pd.DataFrame) -> pd.DataFrame:
        """清理Excel數(shù)據(jù)特有的問(wèn)題"""
        
        # 移除完全空的行和列
        df = df.dropna(how='all').dropna(axis=1, how='all')
        
        # 重置索引
        df = df.reset_index(drop=True)
        
        # 清理列名中的特殊字符
        df.columns = [str(col).strip().replace('\n', ' ') for col in df.columns]
        
        return df

class DatabaseReader(BaseDataReader):
    """數(shù)據(jù)庫(kù)讀取器"""
    
    def __init__(self, config: ConfigManager):
        super().__init__(config)
        self.connections = {}
    
    def get_connection(self, db_alias: str):
        """獲取數(shù)據(jù)庫(kù)連接"""
        
        if db_alias in self.connections:
            return self.connections[db_alias]
        
        try:
            db_config = self.config.get_database_config(db_alias)
            db_type = db_alias.lower()
            
            if db_type in ['postgresql', 'postgres']:
                import psycopg2
                conn = psycopg2.connect(
                    host=db_config['host'],
                    port=db_config['port'],
                    database=db_config['database'],
                    user=db_config['username'],
                    password=db_config['password']
                )
            elif db_type == 'mysql':
                import mysql.connector
                conn = mysql.connector.connect(
                    host=db_config['host'],
                    port=db_config['port'],
                    database=db_config['database'],
                    user=db_config['username'],
                    password=db_config['password']
                )
            else:
                # 使用SQLAlchemy作為通用連接
                from sqlalchemy import create_engine
                connection_string = self._build_connection_string(db_alias, db_config)
                engine = create_engine(connection_string)
                conn = engine.connect()
            
            self.connections[db_alias] = conn
            logger.info(f"數(shù)據(jù)庫(kù)連接已建立: {db_alias}")
            return conn
            
        except Exception as e:
            logger.error(f"數(shù)據(jù)庫(kù)連接失敗 [{db_alias}]: {e}")
            raise
    
    def _build_connection_string(self, db_alias: str, db_config: Dict) -> str:
        """構(gòu)建數(shù)據(jù)庫(kù)連接字符串"""
        
        db_type = db_alias.lower()
        
        if db_type in ['postgresql', 'postgres']:
            return f"postgresql://{db_config['username']}:{db_config['password']}@{db_config['host']}:{db_config['port']}/{db_config['database']}"
        elif db_type == 'mysql':
            return f"mysql+mysqlconnector://{db_config['username']}:{db_config['password']}@{db_config['host']}:{db_config['port']}/{db_config['database']}"
        else:
            raise ValueError(f"不支持的數(shù)據(jù)庫(kù)類(lèi)型: {db_type}")
    
    def read_query(self, db_alias: str, query: str, params: Dict = None) -> pd.DataFrame:
        """執(zhí)行SQL查詢"""
        
        try:
            conn = self.get_connection(db_alias)
            
            logger.info(f"執(zhí)行SQL查詢 [{db_alias}]: {query[:100]}...")
            
            df = pd.read_sql_query(query, conn, params=params)
            
            if self.validate_dataframe(df, f"DB[{db_alias}]"):
                return df
            else:
                return pd.DataFrame()
                
        except Exception as e:
            return self.handle_read_error(e, f"DB[{db_alias}]")
    
    def read_table(self, db_alias: str, table_name: str, columns: List[str] = None, 
                   where_clause: str = None, limit: int = None) -> pd.DataFrame:
        """讀取整個(gè)表或表的部分?jǐn)?shù)據(jù)"""
        
        try:
            # 構(gòu)建查詢
            column_list = '*' if not columns else ', '.join(columns)
            query = f"SELECT {column_list} FROM {table_name}"
            
            if where_clause:
                query += f" WHERE {where_clause}"
            
            if limit:
                query += f" LIMIT {limit}"
            
            return self.read_query(db_alias, query)
            
        except Exception as e:
            return self.handle_read_error(e, f"DB[{db_alias}.{table_name}]")
    
    def close_connections(self):
        """關(guān)閉所有數(shù)據(jù)庫(kù)連接"""
        for alias, conn in self.connections.items():
            try:
                conn.close()
                logger.info(f"數(shù)據(jù)庫(kù)連接已關(guān)閉: {alias}")
            except Exception as e:
                logger.error(f"關(guān)閉數(shù)據(jù)庫(kù)連接失敗 [{alias}]: {e}")
        
        self.connections.clear()

5. 數(shù)據(jù)整合與轉(zhuǎn)換引擎

智能數(shù)據(jù)整合器

from typing import Dict, List, Any, Callable
import hashlib
from datetime import datetime

class DataIntegrationEngine:
    """數(shù)據(jù)整合引擎：統(tǒng)一處理多源數(shù)據(jù)"""
    
    def __init__(self, config: ConfigManager):
        self.config = config
        self.readers = {
            'csv': CSVDataReader(config),
            'excel': ExcelDataReader(config),
            'database': DatabaseReader(config)
        }
        
        # 數(shù)據(jù)緩存
        self.data_cache = {}
        self.schema_registry = {}
    
    def load_data(self, data_sources: List[Dict[str, Any]]) -> Dict[str, pd.DataFrame]:
        """加載多個(gè)數(shù)據(jù)源"""
        
        loaded_data = {}
        
        for source in data_sources:
            source_type = source.get('type')
            source_name = source.get('name', f"source_{len(loaded_data)}")
            
            if source_type not in self.readers:
                logger.warning(f"不支持的數(shù)據(jù)源類(lèi)型: {source_type}，跳過(guò) {source_name}")
                continue
            
            try:
                # 生成緩存鍵
                cache_key = self._generate_cache_key(source)
                
                # 檢查緩存
                if cache_key in self.data_cache:
                    logger.info(f"使用緩存數(shù)據(jù): {source_name}")
                    loaded_data[source_name] = self.data_cache[cache_key]
                    continue
                
                # 讀取數(shù)據(jù)
                reader = self.readers[source_type]
                df = self._read_with_reader(reader, source)
                
                if not df.empty:
                    # 數(shù)據(jù)預(yù)處理
                    df = self._preprocess_data(df, source)
                    
                    # 注冊(cè)schema
                    self._register_schema(source_name, df)
                    
                    # 緩存數(shù)據(jù)
                    self.data_cache[cache_key] = df
                    loaded_data[source_name] = df
                    
                    logger.info(f"成功加載數(shù)據(jù)源: {source_name}，形狀: {df.shape}")
                else:
                    logger.warning(f"數(shù)據(jù)源為空: {source_name}")
                    
            except Exception as e:
                logger.error(f"加載數(shù)據(jù)源失敗 [{source_name}]: {e}")
                continue
        
        return loaded_data
    
    def _read_with_reader(self, reader: BaseDataReader, source: Dict) -> pd.DataFrame:
        """使用對(duì)應(yīng)的讀取器讀取數(shù)據(jù)"""
        
        source_type = source.get('type')
        parameters = source.get('parameters', {})
        
        if source_type == 'csv':
            file_path = source['path']
            return reader.read(file_path, **parameters)
        
        elif source_type == 'excel':
            file_path = source['path']
            sheet_name = parameters.pop('sheet_name', 0)
            return reader.read(file_path, sheet_name=sheet_name, **parameters)
        
        elif source_type == 'database':
            db_alias = source['connection']
            query = source.get('query')
            table_name = source.get('table')
            
            if query:
                return reader.read_query(db_alias, query, parameters.get('params'))
            elif table_name:
                return reader.read_table(
                    db_alias, 
                    table_name,
                    columns=parameters.get('columns'),
                    where_clause=parameters.get('where'),
                    limit=parameters.get('limit')
                )
            else:
                raise ValueError("數(shù)據(jù)庫(kù)源必須提供query或table參數(shù)")
        
        else:
            raise ValueError(f"未知的數(shù)據(jù)源類(lèi)型: {source_type}")
    
    def _generate_cache_key(self, source: Dict) -> str:
        """生成緩存鍵"""
        
        source_str = json.dumps(source, sort_keys=True)
        return hashlib.md5(source_str.encode()).hexdigest()
    
    def _preprocess_data(self, df: pd.DataFrame, source: Dict) -> pd.DataFrame:
        """數(shù)據(jù)預(yù)處理"""
        
        # 應(yīng)用自定義預(yù)處理函數(shù)
        preprocess_func = source.get('preprocess')
        if preprocess_func and callable(preprocess_func):
            df = preprocess_func(df)
        
        # 標(biāo)準(zhǔn)預(yù)處理步驟
        df = self._standard_preprocessing(df, source)
        
        return df
    
    def _standard_preprocessing(self, df: pd.DataFrame, source: Dict) -> pd.DataFrame:
        """標(biāo)準(zhǔn)預(yù)處理流程"""
        
        # 1. 清理列名
        df.columns = [self._clean_column_name(col) for col in df.columns]
        
        # 2. 處理數(shù)據(jù)類(lèi)型
        type_mapping = source.get('type_mapping', {})
        for col, dtype in type_mapping.items():
            if col in df.columns:
                try:
                    df[col] = self._convert_column_type(df[col], dtype)
                except Exception as e:
                    logger.warning(f"列類(lèi)型轉(zhuǎn)換失敗 [{col} -> {dtype}]: {e}")
        
        # 3. 添加數(shù)據(jù)源標(biāo)識(shí)
        df['_data_source'] = source.get('name', 'unknown')
        df['_load_timestamp'] = datetime.now()
        
        return df
    
    def _clean_column_name(self, column_name: str) -> str:
        """清理列名"""
        
        # 移除特殊字符，保留字母、數(shù)字、下劃線
        import re
        cleaned = re.sub(r'[^\w]', '_', str(column_name))
        # 移除連續(xù)的下劃線
        cleaned = re.sub(r'_+', '_', cleaned)
        # 移除首尾的下劃線
        cleaned = cleaned.strip('_')
        # 轉(zhuǎn)換為小寫(xiě)
        return cleaned.lower()
    
    def _convert_column_type(self, series: pd.Series, target_type: str) -> pd.Series:
        """轉(zhuǎn)換列數(shù)據(jù)類(lèi)型"""
        
        type_handlers = {
            'string': lambda s: s.astype(str),
            'integer': lambda s: pd.to_numeric(s, errors='coerce').fillna(0).astype(int),
            'float': lambda s: pd.to_numeric(s, errors='coerce'),
            'datetime': lambda s: pd.to_datetime(s, errors='coerce'),
            'boolean': lambda s: s.astype(str).str.lower().isin(['true', '1', 'yes', 'y']).astype(bool)
        }
        
        handler = type_handlers.get(target_type.lower())
        if handler:
            return handler(series)
        else:
            return series
    
    def _register_schema(self, source_name: str, df: pd.DataFrame):
        """注冊(cè)數(shù)據(jù)schema"""
        
        schema = {
            'columns': list(df.columns),
            'dtypes': {col: str(dtype) for col, dtype in df.dtypes.items()},
            'row_count': len(df),
            'null_counts': df.isnull().sum().to_dict(),
            'sample_data': df.head(3).to_dict('records')
        }
        
        self.schema_registry[source_name] = schema
        logger.info(f"Schema已注冊(cè): {source_name}")
    
    def merge_data(self, data_dict: Dict[str, pd.DataFrame], 
                   merge_strategy: Dict[str, Any]) -> pd.DataFrame:
        """合并多個(gè)數(shù)據(jù)源"""
        
        if not data_dict:
            return pd.DataFrame()
        
        data_frames = list(data_dict.values())
        merge_type = merge_strategy.get('type', 'concat')
        
        try:
            if merge_type == 'concat':
                result = self._concat_dataframes(data_frames, merge_strategy)
            elif merge_type == 'join':
                result = self._join_dataframes(data_dict, merge_strategy)
            elif merge_type == 'union':
                result = self._union_dataframes(data_frames, merge_strategy)
            else:
                raise ValueError(f"不支持的合并類(lèi)型: {merge_type}")
            
            logger.info(f"數(shù)據(jù)合并完成，最終形狀: {result.shape}")
            return result
            
        except Exception as e:
            logger.error(f"數(shù)據(jù)合并失敗: {e}")
            return pd.DataFrame()
    
    def _concat_dataframes(self, data_frames: List[pd.DataFrame], strategy: Dict) -> pd.DataFrame:
        """垂直合并數(shù)據(jù)框"""
        
        # 對(duì)齊列
        if strategy.get('align_columns', True):
            all_columns = set()
            for df in data_frames:
                all_columns.update(df.columns)
            
            aligned_dfs = []
            for df in data_frames:
                # 添加缺失的列
                for col in all_columns:
                    if col not in df.columns:
                        df[col] = None
                # 按統(tǒng)一順序排列列
                df = df[list(all_columns)]
                aligned_dfs.append(df)
            
            data_frames = aligned_dfs
        
        return pd.concat(data_frames, ignore_index=True, sort=False)
    
    def _join_dataframes(self, data_dict: Dict[str, pd.DataFrame], strategy: Dict) -> pd.DataFrame:
        """連接數(shù)據(jù)框"""
        
        join_keys = strategy.get('keys', [])
        join_type = strategy.get('join_type', 'inner')
        
        if not data_dict:
            return pd.DataFrame()
        
        data_frames = list(data_dict.values())
        result = data_frames[0]
        
        for i in range(1, len(data_frames)):
            result = result.merge(
                data_frames[i],
                on=join_keys,
                how=join_type,
                suffixes=(f'_{i-1}', f'_{i}')
            )
        
        return result
    
    def _union_dataframes(self, data_frames: List[pd.DataFrame], strategy: Dict) -> pd.DataFrame:
        """求并集合并數(shù)據(jù)框"""
        
        common_columns = set.intersection(*[set(df.columns) for df in data_frames])
        
        if not common_columns:
            logger.warning("沒(méi)有共同列，無(wú)法執(zhí)行union操作")
            return pd.DataFrame()
        
        # 只保留共同列
        union_dfs = [df[list(common_columns)] for df in data_frames]
        
        return pd.concat(union_dfs, ignore_index=True).drop_duplicates()
    
    def clear_cache(self):
        """清理緩存"""
        self.data_cache.clear()
        logger.info("數(shù)據(jù)緩存已清理")

6. 數(shù)據(jù)質(zhì)量監(jiān)控與驗(yàn)證

全面的數(shù)據(jù)質(zhì)量框架

class DataQualityMonitor:
    """數(shù)據(jù)質(zhì)量監(jiān)控器"""
    
    def __init__(self):
        self.quality_metrics = {}
        self.validation_rules = {}
    
    def add_validation_rule(self, rule_name: str, rule_func: Callable, 
                           description: str = "") -> None:
        """添加數(shù)據(jù)驗(yàn)證規(guī)則"""
        self.validation_rules[rule_name] = {
            'function': rule_func,
            'description': description
        }
        logger.info(f"驗(yàn)證規(guī)則已添加: {rule_name}")
    
    def validate_dataset(self, df: pd.DataFrame, dataset_name: str) -> Dict[str, Any]:
        """驗(yàn)證數(shù)據(jù)集質(zhì)量"""
        
        validation_results = {
            'dataset_name': dataset_name,
            'timestamp': datetime.now(),
            'basic_stats': self._get_basic_stats(df),
            'quality_metrics': self._calculate_quality_metrics(df),
            'rule_violations': self._check_validation_rules(df),
            'data_issues': self._detect_data_issues(df)
        }
        
        # 計(jì)算總體質(zhì)量分?jǐn)?shù)
        validation_results['quality_score'] = self._calculate_overall_score(
            validation_results['quality_metrics'],
            validation_results['rule_violations']
        )
        
        self.quality_metrics[dataset_name] = validation_results
        return validation_results
    
    def _get_basic_stats(self, df: pd.DataFrame) -> Dict[str, Any]:
        """獲取基本統(tǒng)計(jì)信息"""
        
        return {
            'row_count': len(df),
            'column_count': len(df.columns),
            'memory_usage_mb': df.memory_usage(deep=True).sum() / 1024**2,
            'data_types': {col: str(dtype) for col, dtype in df.dtypes.items()}
        }
    
    def _calculate_quality_metrics(self, df: pd.DataFrame) -> Dict[str, float]:
        """計(jì)算數(shù)據(jù)質(zhì)量指標(biāo)"""
        
        total_cells = df.shape[0] * df.shape[1]
        
        if total_cells == 0:
            return {
                'completeness': 0.0,
                'consistency': 0.0,
                'accuracy': 0.0,
                'uniqueness': 0.0,
                'timeliness': 1.0
            }
        
        # 完整性：非空值比例
        completeness = 1 - (df.isnull().sum().sum() / total_cells)
        
        # 一致性：數(shù)據(jù)類(lèi)型一致性
        type_consistency = self._check_type_consistency(df)
        
        # 準(zhǔn)確性：基于業(yè)務(wù)規(guī)則（需要自定義）
        accuracy = self._estimate_accuracy(df)
        
        # 唯一性：重復(fù)行比例
        uniqueness = 1 - (df.duplicated().sum() / len(df))
        
        # 時(shí)效性：基于時(shí)間戳（如果存在）
        timeliness = self._check_timeliness(df)
        
        return {
            'completeness': completeness,
            'consistency': type_consistency,
            'accuracy': accuracy,
            'uniqueness': uniqueness,
            'timeliness': timeliness
        }
    
    def _check_type_consistency(self, df: pd.DataFrame) -> float:
        """檢查數(shù)據(jù)類(lèi)型一致性"""
        
        consistent_columns = 0
        for col in df.columns:
            try:
                # 嘗試轉(zhuǎn)換為數(shù)值類(lèi)型
                pd.to_numeric(df[col], errors='raise')
                consistent_columns += 1
            except:
                try:
                    # 嘗試轉(zhuǎn)換為日期類(lèi)型
                    pd.to_datetime(df[col], errors='raise')
                    consistent_columns += 1
                except:
                    # 保持為字符串類(lèi)型
                    consistent_columns += 1
        
        return consistent_columns / len(df.columns) if df.columns.any() else 1.0
    
    def _estimate_accuracy(self, df: pd.DataFrame) -> float:
        """估計(jì)數(shù)據(jù)準(zhǔn)確性（基于簡(jiǎn)單啟發(fā)式規(guī)則）"""
        
        accuracy_indicators = []
        
        # 檢查數(shù)值列的合理性
        numeric_cols = df.select_dtypes(include=[np.number]).columns
        for col in numeric_cols:
            if df[col].notna().any():
                # 檢查異常值（超出3個(gè)標(biāo)準(zhǔn)差）
                z_scores = np.abs((df[col] - df[col].mean()) / df[col].std())
                outlier_ratio = (z_scores > 3).sum() / len(df)
                accuracy_indicators.append(1 - outlier_ratio)
        
        # 檢查分類(lèi)列的合理性（值在合理范圍內(nèi)）
        categorical_cols = df.select_dtypes(include=['object']).columns
        for col in categorical_cols[:5]:  # 只檢查前5個(gè)分類(lèi)列
            if df[col].notna().any():
                # 簡(jiǎn)單檢查：最常見(jiàn)的值應(yīng)該占一定比例
                value_counts = df[col].value_counts()
                if len(value_counts) > 0:
                    top_ratio = value_counts.iloc[0] / len(df)
                    accuracy_indicators.append(min(top_ratio * 2, 1.0))
        
        return np.mean(accuracy_indicators) if accuracy_indicators else 0.8
    
    def _check_timeliness(self, df: pd.DataFrame) -> float:
        """檢查數(shù)據(jù)時(shí)效性"""
        
        # 查找可能的時(shí)間戳列
        time_columns = []
        for col in df.columns:
            col_lower = col.lower()
            if any(time_keyword in col_lower for time_keyword in 
                  ['time', 'date', 'timestamp', 'created', 'updated']):
                time_columns.append(col)
        
        if not time_columns:
            return 1.0  # 沒(méi)有時(shí)間列，默認(rèn)時(shí)效性為1
        
        # 檢查最近的數(shù)據(jù)時(shí)間
        latest_times = []
        for col in time_columns:
            try:
                time_series = pd.to_datetime(df[col], errors='coerce')
                if time_series.notna().any():
                    latest_time = time_series.max()
                    if pd.notna(latest_time):
                        days_ago = (datetime.now() - latest_time).days
                        # 數(shù)據(jù)在30天內(nèi)為新鮮
                        timeliness = max(0, 1 - (days_ago / 30))
                        latest_times.append(timeliness)
            except:
                continue
        
        return np.mean(latest_times) if latest_times else 0.5
    
    def _check_validation_rules(self, df: pd.DataFrame) -> Dict[str, List[str]]:
        """檢查驗(yàn)證規(guī)則"""
        
        violations = {}
        for rule_name, rule_info in self.validation_rules.items():
            try:
                rule_func = rule_info['function']
                result = rule_func(df)
                
                if result is not True and result is not None:
                    violations[rule_name] = result
                    
            except Exception as e:
                violations[rule_name] = f"規(guī)則執(zhí)行錯(cuò)誤: {e}"
        
        return violations
    
    def _detect_data_issues(self, df: pd.DataFrame) -> List[str]:
        """檢測(cè)數(shù)據(jù)問(wèn)題"""
        
        issues = []
        
        # 檢查完全空的列
        empty_cols = df.columns[df.isnull().all()].tolist()
        if empty_cols:
            issues.append(f"完全空的列: {empty_cols}")
        
        # 檢查常數(shù)列
        constant_cols = []
        for col in df.columns:
            if df[col].nunique() == 1:
                constant_cols.append(col)
        if constant_cols:
            issues.append(f"常數(shù)列: {constant_cols}")
        
        # 檢查高基數(shù)分類(lèi)列
        object_cols = df.select_dtypes(include=['object']).columns
        high_cardinality_cols = []
        for col in object_cols:
            if df[col].nunique() > len(df) * 0.5:  # 唯一值超過(guò)50%
                high_cardinality_cols.append(col)
        if high_cardinality_cols:
            issues.append(f"高基數(shù)分類(lèi)列: {high_cardinality_cols}")
        
        return issues
    
    def _calculate_overall_score(self, quality_metrics: Dict, 
                               rule_violations: Dict) -> float:
        """計(jì)算總體質(zhì)量分?jǐn)?shù)"""
        
        # 基礎(chǔ)質(zhì)量指標(biāo)權(quán)重
        weights = {
            'completeness': 0.3,
            'consistency': 0.2,
            'accuracy': 0.25,
            'uniqueness': 0.15,
            'timeliness': 0.1
        }
        
        # 計(jì)算加權(quán)平均
        base_score = sum(quality_metrics[metric] * weight 
                        for metric, weight in weights.items())
        
        # 規(guī)則違反懲罰
        violation_penalty = min(len(rule_violations) * 0.1, 0.3)
        
        return max(0, base_score - violation_penalty)
    
    def generate_quality_report(self, dataset_name: str) -> str:
        """生成質(zhì)量報(bào)告"""
        
        if dataset_name not in self.quality_metrics:
            return f"未找到數(shù)據(jù)集: {dataset_name}"
        
        metrics = self.quality_metrics[dataset_name]
        report = []
        
        report.append("=" * 60)
        report.append(f"數(shù)據(jù)質(zhì)量報(bào)告 - {dataset_name}")
        report.append("=" * 60)
        report.append(f"生成時(shí)間: {metrics['timestamp']}")
        report.append(f"總體質(zhì)量分?jǐn)?shù): {metrics['quality_score']:.2%}")
        report.append("")
        
        # 基本統(tǒng)計(jì)
        report.append("基本統(tǒng)計(jì):")
        stats = metrics['basic_stats']
        report.append(f"  行數(shù): {stats['row_count']:,}")
        report.append(f"  列數(shù): {stats['column_count']}")
        report.append(f"  內(nèi)存使用: {stats['memory_usage_mb']:.2f} MB")
        report.append("")
        
        # 質(zhì)量指標(biāo)
        report.append("質(zhì)量指標(biāo):")
        for metric, value in metrics['quality_metrics'].items():
            report.append(f"  {metric:15}: {value:.2%}")
        report.append("")
        
        # 規(guī)則違反
        if metrics['rule_violations']:
            report.append("規(guī)則違反:")
            for rule, violation in metrics['rule_violations'].items():
                report.append(f"  {rule}: {violation}")
            report.append("")
        
        # 數(shù)據(jù)問(wèn)題
        if metrics['data_issues']:
            report.append("檢測(cè)到的問(wèn)題:")
            for issue in metrics['data_issues']:
                report.append(f"  ? {issue}")
        
        return "\n".join(report)

7. 完整實(shí)戰(zhàn)示例

7.1 端到端數(shù)據(jù)整合流程

def complete_data_integration_example():
    """完整的數(shù)據(jù)整合示例"""
    
    # 1. 初始化配置
    config = ConfigManager()
    
    # 2. 初始化數(shù)據(jù)整合引擎
    integration_engine = DataIntegrationEngine(config)
    quality_monitor = DataQualityMonitor()
    
    # 3. 定義數(shù)據(jù)源配置
    data_sources = [
        {
            'name': 'sales_data_csv',
            'type': 'csv',
            'path': 'data/sales_data.csv',
            'parameters': {
                'encoding': 'utf-8',
                'sep': ','
            }
        },
        {
            'name': 'customer_data_excel',
            'type': 'excel',
            'path': 'data/customer_data.xlsx',
            'parameters': {
                'sheet_name': 'Customers',
                'header': 0
            }
        },
        {
            'name': 'product_data_db',
            'type': 'database',
            'connection': 'postgresql',
            'table': 'products',
            'parameters': {
                'columns': ['product_id', 'product_name', 'category', 'price'],
                'where': 'active = true'
            }
        }
    ]
    
    # 4. 添加數(shù)據(jù)驗(yàn)證規(guī)則
    def validate_sales_data(df):
        """驗(yàn)證銷(xiāo)售數(shù)據(jù)規(guī)則"""
        issues = []
        
        # 檢查銷(xiāo)售額非負(fù)
        if 'sales_amount' in df.columns:
            negative_sales = (df['sales_amount'] < 0).sum()
            if negative_sales > 0:
                issues.append(f"發(fā)現(xiàn) {negative_sales} 條負(fù)銷(xiāo)售額記錄")
        
        # 檢查日期合理性
        if 'sale_date' in df.columns:
            try:
                sale_dates = pd.to_datetime(df['sale_date'])
                future_sales = (sale_dates > datetime.now()).sum()
                if future_sales > 0:
                    issues.append(f"發(fā)現(xiàn) {future_sales} 條未來(lái)日期銷(xiāo)售記錄")
            except:
                issues.append("銷(xiāo)售日期格式異常")
        
        return issues if issues else True
    
    quality_monitor.add_validation_rule(
        'sales_data_validation', 
        validate_sales_data,
        '驗(yàn)證銷(xiāo)售數(shù)據(jù)的基本業(yè)務(wù)規(guī)則'
    )
    
    # 5. 加載數(shù)據(jù)
    print("開(kāi)始加載數(shù)據(jù)源...")
    loaded_data = integration_engine.load_data(data_sources)
    
    # 6. 數(shù)據(jù)質(zhì)量檢查
    print("\n進(jìn)行數(shù)據(jù)質(zhì)量檢查...")
    quality_reports = {}
    for name, df in loaded_data.items():
        quality_report = quality_monitor.validate_dataset(df, name)
        quality_reports[name] = quality_report
        
        # 打印質(zhì)量報(bào)告
        print(quality_monitor.generate_quality_report(name))
        print()
    
    # 7. 數(shù)據(jù)整合
    print("開(kāi)始數(shù)據(jù)整合...")
    merge_strategy = {
        'type': 'concat',
        'align_columns': True
    }
    
    integrated_data = integration_engine.merge_data(loaded_data, merge_strategy)
    
    # 8. 最終質(zhì)量檢查
    print("最終整合數(shù)據(jù)質(zhì)量檢查...")
    final_quality = quality_monitor.validate_dataset(integrated_data, 'integrated_dataset')
    print(quality_monitor.generate_quality_report('integrated_dataset'))
    
    # 9. 保存整合結(jié)果
    output_path = 'output/integrated_data.csv'
    integrated_data.to_csv(output_path, index=False, encoding='utf-8')
    print(f"\n整合數(shù)據(jù)已保存至: {output_path}")
    print(f"最終數(shù)據(jù)集形狀: {integrated_data.shape}")
    
    # 10. 清理資源
    integration_engine.clear_cache()
    if hasattr(integration_engine.readers['database'], 'close_connections'):
        integration_engine.readers['database'].close_connections()
    
    return integrated_data, quality_reports

# 運(yùn)行完整示例
if __name__ == "__main__":
    try:
        final_data, reports = complete_data_integration_example()
        print("數(shù)據(jù)整合流程完成!")
    except Exception as e:
        print(f"數(shù)據(jù)整合流程失敗: {e}")
        import traceback
        traceback.print_exc()

7.2 高級(jí)功能：增量數(shù)據(jù)加載

class IncrementalDataLoader:
    """增量數(shù)據(jù)加載器"""
    
    def __init__(self, integration_engine: DataIntegrationEngine):
        self.engine = integration_engine
        self.last_load_info = self._load_last_run_info()
    
    def _load_last_run_info(self) -> Dict:
        """加載上次運(yùn)行信息"""
        
        info_file = 'metadata/last_run_info.json'
        if os.path.exists(info_file):
            with open(info_file, 'r') as f:
                return json.load(f)
        return {}
    
    def _save_last_run_info(self, info: Dict) -> None:
        """保存運(yùn)行信息"""
        
        os.makedirs('metadata', exist_ok=True)
        with open('metadata/last_run_info.json', 'w') as f:
            json.dump(info, f, indent=2)
    
    def incremental_load(self, data_sources: List[Dict]) -> Dict[str, pd.DataFrame]:
        """增量加載數(shù)據(jù)"""
        
        incremental_data = {}
        current_run_info = {'timestamp': datetime.now().isoformat()}
        
        for source in data_sources:
            source_name = source['name']
            source_type = source['type']
            
            # 獲取增量條件
            incremental_condition = self._get_incremental_condition(
                source_name, source_type, source
            )
            
            if incremental_condition:
                # 修改數(shù)據(jù)源配置以包含增量條件
                modified_source = self._apply_incremental_filter(
                    source, incremental_condition
                )
                
                logger.info(f"執(zhí)行增量加載: {source_name}，條件: {incremental_condition}")
                data = self.engine.load_data([modified_source])
                
                if data and source_name in data:
                    incremental_data[source_name] = data[source_name]
            
            # 記錄本次運(yùn)行信息
            current_run_info[source_name] = {
                'loaded_at': datetime.now().isoformat(),
                'row_count': len(incremental_data.get(source_name, pd.DataFrame()))
            }
        
        # 保存運(yùn)行信息
        self._save_last_run_info(current_run_info)
        self.last_load_info = current_run_info
        
        return incremental_data
    
    def _get_incremental_condition(self, source_name: str, source_type: str, 
                                 source: Dict) -> Optional[str]:
        """獲取增量加載條件"""
        
        if source_type == 'database':
            # 數(shù)據(jù)庫(kù)增量加載
            incremental_field = source.get('incremental_field', 'updated_at')
            last_value = self.last_load_info.get(source_name, {}).get('last_value')
            
            if last_value:
                return f"{incremental_field} > '{last_value}'"
            else:
                # 首次運(yùn)行，加載所有數(shù)據(jù)
                return None
        
        elif source_type in ['csv', 'excel']:
            # 文件增量加載 - 基于文件修改時(shí)間
            file_path = source['path']
            if os.path.exists(file_path):
                current_mtime = os.path.getmtime(file_path)
                last_mtime = self.last_load_info.get(source_name, {}).get('file_mtime')
                
                if last_mtime and current_mtime > last_mtime:
                    # 文件已修改，需要重新加載
                    return "file_modified"
                elif not last_mtime:
                    # 首次運(yùn)行
                    return None
        
        return None
    
    def _apply_incremental_filter(self, source: Dict, condition: str) -> Dict:
        """應(yīng)用增量過(guò)濾條件"""
        
        source_type = source['type']
        modified_source = source.copy()
        
        if source_type == 'database':
            if 'where' in modified_source.get('parameters', {}):
                # 合并現(xiàn)有條件和增量條件
                modified_source['parameters']['where'] = \
                    f"({modified_source['parameters']['where']}) AND ({condition})"
            else:
                modified_source['parameters']['where'] = condition
        
        return modified_source

# 使用增量加載的示例
def incremental_loading_example():
    """增量加載示例"""
    
    config = ConfigManager()
    engine = DataIntegrationEngine(config)
    incremental_loader = IncrementalDataLoader(engine)
    
    data_sources = [
        {
            'name': 'daily_sales',
            'type': 'database',
            'connection': 'postgresql',
            'table': 'sales',
            'incremental_field': 'sale_date',
            'parameters': {
                'columns': ['sale_id', 'sale_date', 'amount', 'customer_id']
            }
        }
    ]
    
    print("執(zhí)行增量數(shù)據(jù)加載...")
    new_data = incremental_loader.incremental_load(data_sources)
    
    for name, df in new_data.items():
        print(f"增量加載 {name}: {len(df)} 行新數(shù)據(jù)")
    
    return new_data

8. 代碼自查清單

8.1 代碼質(zhì)量檢查

在部署數(shù)據(jù)整合系統(tǒng)前，請(qǐng)進(jìn)行全面的代碼質(zhì)量檢查：

功能完整性檢查

所有數(shù)據(jù)源類(lèi)型（CSV、Excel、SQL）支持完整
錯(cuò)誤處理機(jī)制覆蓋所有可能異常
數(shù)據(jù)驗(yàn)證和質(zhì)量監(jiān)控功能完善
增量加載和緩存機(jī)制正常工作

性能優(yōu)化檢查

大數(shù)據(jù)集分塊處理實(shí)現(xiàn)正確
數(shù)據(jù)庫(kù)連接池配置合理
內(nèi)存使用在可控范圍內(nèi)
緩存策略有效減少重復(fù)讀取

代碼規(guī)范檢查

函數(shù)和類(lèi)命名符合PEP8規(guī)范
類(lèi)型提示完整準(zhǔn)確
文檔字符串覆蓋所有公共接口
日志記錄詳細(xì)且分級(jí)合理

安全性檢查

數(shù)據(jù)庫(kù)密碼等敏感信息通過(guò)環(huán)境變量管理
文件路徑驗(yàn)證防止目錄遍歷攻擊
SQL查詢參數(shù)化防止注入攻擊
錯(cuò)誤信息不泄露敏感信息

8.2 部署前驗(yàn)證

def pre_deployment_validation():
    """部署前驗(yàn)證"""
    
    validation_steps = [
        ("環(huán)境依賴(lài)檢查", validate_environment_dependencies),
        ("配置文件驗(yàn)證", validate_configuration_files),
        ("數(shù)據(jù)源連通性", validate_data_source_connectivity),
        ("功能完整性", validate_functional_completeness),
        ("性能基準(zhǔn)測(cè)試", run_performance_benchmarks)
    ]
    
    print("開(kāi)始部署前驗(yàn)證...")
    results = {}
    
    for step_name, validation_func in validation_steps:
        print(f"\n驗(yàn)證: {step_name}...")
        try:
            result = validation_func()
            results[step_name] = ("? 通過(guò)", result)
            print("? 通過(guò)")
        except Exception as e:
            results[step_name] = ("? 失敗", str(e))
            print(f"? 失敗: {e}")
    
    # 生成驗(yàn)證報(bào)告
    print("\n" + "="*60)
    print("部署前驗(yàn)證報(bào)告")
    print("="*60)
    
    all_passed = all("通過(guò)" in result[0] for result in results.values())
    
    for step_name, (status, details) in results.items():
        print(f"{step_name:20} {status}")
        if details and "失敗" in status:
            print(f"  詳細(xì)信息: {details}")
    
    print(f"\n總體結(jié)果: {'? 所有檢查通過(guò)，可以部署' if all_passed else '? 存在未通過(guò)檢查項(xiàng)'}")
    return all_passed, results

def validate_environment_dependencies():
    """驗(yàn)證環(huán)境依賴(lài)"""
    return EnvironmentValidator.validate_environment()[0]

def validate_configuration_files():
    """驗(yàn)證配置文件"""
    config = ConfigManager()
    required_sections = ['data_sources', 'processing', 'logging']
    
    for section in required_sections:
        if section not in config.config:
            raise ValueError(f"缺少配置段: {section}")
    
    return True

def validate_data_source_connectivity():
    """驗(yàn)證數(shù)據(jù)源連通性"""
    # 這里實(shí)現(xiàn)具體的數(shù)據(jù)源連通性測(cè)試
    return True

def validate_functional_completeness():
    """驗(yàn)證功能完整性"""
    # 這里實(shí)現(xiàn)核心功能的完整性測(cè)試
    return True

def run_performance_benchmarks():
    """運(yùn)行性能基準(zhǔn)測(cè)試"""
    # 這里實(shí)現(xiàn)性能基準(zhǔn)測(cè)試
    return "性能測(cè)試通過(guò)"

# 運(yùn)行部署前驗(yàn)證
if __name__ == "__main__":
    deployment_ready, validation_report = pre_deployment_validation()

9. 總結(jié)與最佳實(shí)踐

9.1 核心架構(gòu)價(jià)值

通過(guò)本文實(shí)現(xiàn)的自動(dòng)化數(shù)據(jù)整合系統(tǒng)，我們解決了企業(yè)數(shù)據(jù)整合中的關(guān)鍵挑戰(zhàn)：

統(tǒng)一接口：為不同數(shù)據(jù)源提供一致的訪問(wèn)接口
質(zhì)量保障：內(nèi)置數(shù)據(jù)質(zhì)量監(jiān)控和驗(yàn)證機(jī)制
性能優(yōu)化：支持大數(shù)據(jù)集處理和增量加載
可維護(hù)性：模塊化設(shè)計(jì)，易于擴(kuò)展和維護(hù)
可觀測(cè)性：完善的日志記錄和監(jiān)控能力

9.2 生產(chǎn)環(huán)境最佳實(shí)踐

9.2.1 配置管理

# 生產(chǎn)環(huán)境配置示例
PRODUCTION_CONFIG = {
    "data_sources": {
        "database": {
            "timeout": 60,
            "pool_size": 10,
            "max_overflow": 20,
            "pool_recycle": 3600
        }
    },
    "processing": {
        "max_workers": 8,
        "chunk_size": 50000,
        "temp_directory": "/data/temp"
    },
    "monitoring": {
        "enable_metrics": True,
        "alert_threshold": 0.85,  # 質(zhì)量分?jǐn)?shù)告警閾值
        "slack_webhook": "https://hooks.slack.com/..."  # 告警通知
    }
}

9.2.2 錯(cuò)誤處理策略

重試機(jī)制：對(duì)暫時(shí)性錯(cuò)誤實(shí)現(xiàn)指數(shù)退避重試
熔斷機(jī)制：對(duì)持續(xù)失敗的數(shù)據(jù)源暫時(shí)禁用
降級(jí)方案：主數(shù)據(jù)源失敗時(shí)使用備用數(shù)據(jù)源
告警通知：關(guān)鍵錯(cuò)誤通過(guò)多種渠道通知相關(guān)人員

9.2.3 性能優(yōu)化建議

數(shù)據(jù)庫(kù)層面：使用連接池、合理索引、查詢優(yōu)化
內(nèi)存管理：分塊處理大數(shù)據(jù)、及時(shí)釋放資源
并行處理：利用多核CPU并行處理獨(dú)立任務(wù)
緩存策略：合理使用內(nèi)存和磁盤(pán)緩存

9.3 擴(kuò)展方向

基于當(dāng)前架構(gòu)，可以進(jìn)一步擴(kuò)展以下功能：

實(shí)時(shí)數(shù)據(jù)流：集成Kafka、RabbitMQ等消息隊(duì)列
云數(shù)據(jù)源：支持AWS S3、Google BigQuery等云服務(wù)
API數(shù)據(jù)源：封裝REST API、GraphQL等數(shù)據(jù)接口
數(shù)據(jù)血緣：跟蹤數(shù)據(jù)來(lái)源和轉(zhuǎn)換過(guò)程
自動(dòng)發(fā)現(xiàn)：自動(dòng)發(fā)現(xiàn)和注冊(cè)新的數(shù)據(jù)源

9.4 結(jié)語(yǔ)

自動(dòng)化數(shù)據(jù)整合是現(xiàn)代數(shù)據(jù)架構(gòu)的核心組件。通過(guò)本文提供的完整解決方案，企業(yè)可以建立健壯、可擴(kuò)展的數(shù)據(jù)整合流水線，顯著提升數(shù)據(jù)價(jià)值釋放的效率和質(zhì)量。

記住，優(yōu)秀的數(shù)據(jù)整合系統(tǒng)不僅是技術(shù)實(shí)現(xiàn)，更是業(yè)務(wù)價(jià)值和技術(shù)可行性的平衡。始終從業(yè)務(wù)需求出發(fā)，優(yōu)先解決最關(guān)鍵的數(shù)據(jù)整合挑戰(zhàn)，逐步構(gòu)建和完善數(shù)據(jù)能力體系。

到此這篇關(guān)于Python實(shí)現(xiàn)從多個(gè)數(shù)據(jù)源(CSV,Excel,SQL)自動(dòng)整合數(shù)據(jù)的文章就介紹到這了,更多相關(guān)Python數(shù)據(jù)源數(shù)據(jù)整合內(nèi)容請(qǐng)搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家！

您可能感興趣的文章:

国产无遮挡裸体免费直播视频,久久精品国产蜜臀av,动漫在线视频一区二区,欧亚日韩一区二区三区,久艹在线免费视频,国产精品美女网站免费,正在播放 97超级视频在线观看,斗破苍穹年番在线观看免费,51最新乱码中文字幕

Python實(shí)現(xiàn)從多個(gè)數(shù)據(jù)源(CSV,Excel,SQL)自動(dòng)整合數(shù)據(jù)

目錄

1. 引言：數(shù)據(jù)整合的挑戰(zhàn)與重要性

1.1 現(xiàn)代數(shù)據(jù)分析中的數(shù)據(jù)孤島問(wèn)題

1.2 自動(dòng)化數(shù)據(jù)整合的商業(yè)價(jià)值

2. 數(shù)據(jù)源特性分析

2.1 不同數(shù)據(jù)源的特性對(duì)比

2.2 數(shù)據(jù)質(zhì)量常見(jiàn)問(wèn)題

3. 環(huán)境配置與依賴(lài)管理

3.1 完整的依賴(lài)包配置

3.2 配置管理系統(tǒng)

4. 核心數(shù)據(jù)讀取器實(shí)現(xiàn)

5. 數(shù)據(jù)整合與轉(zhuǎn)換引擎

6. 數(shù)據(jù)質(zhì)量監(jiān)控與驗(yàn)證

7. 完整實(shí)戰(zhàn)示例

7.1 端到端數(shù)據(jù)整合流程

7.2 高級(jí)功能：增量數(shù)據(jù)加載

8. 代碼自查清單

8.1 代碼質(zhì)量檢查

8.2 部署前驗(yàn)證

9. 總結(jié)與最佳實(shí)踐

9.1 核心架構(gòu)價(jià)值

9.2 生產(chǎn)環(huán)境最佳實(shí)踐

9.3 擴(kuò)展方向

9.4 結(jié)語(yǔ)

相關(guān)文章

最新評(píng)論

大家感興趣的內(nèi)容

最近更新的內(nèi)容

常用在線小工具

国产无遮挡裸体免费直播视频,久久精品国产蜜臀av,动漫在线视频一区二区,欧亚日韩一区二区三区,久艹在线 免费视频,国产精品美女网站免费,正在播放 97超级视频在线观看,斗破苍穹年番在线观看免费,51最新乱码中文字幕

Python實(shí)現(xiàn)從多個(gè)數(shù)據(jù)源(CSV,Excel,SQL)自動(dòng)整合數(shù)據(jù)

目錄

1. 引言：數(shù)據(jù)整合的挑戰(zhàn)與重要性

1.1 現(xiàn)代數(shù)據(jù)分析中的數(shù)據(jù)孤島問(wèn)題

1.2 自動(dòng)化數(shù)據(jù)整合的商業(yè)價(jià)值

2. 數(shù)據(jù)源特性分析

2.1 不同數(shù)據(jù)源的特性對(duì)比

2.2 數(shù)據(jù)質(zhì)量常見(jiàn)問(wèn)題

3. 環(huán)境配置與依賴(lài)管理

3.1 完整的依賴(lài)包配置

3.2 配置管理系統(tǒng)

4. 核心數(shù)據(jù)讀取器實(shí)現(xiàn)

5. 數(shù)據(jù)整合與轉(zhuǎn)換引擎

6. 數(shù)據(jù)質(zhì)量監(jiān)控與驗(yàn)證

7. 完整實(shí)戰(zhàn)示例

7.1 端到端數(shù)據(jù)整合流程

7.2 高級(jí)功能：增量數(shù)據(jù)加載

8. 代碼自查清單

8.1 代碼質(zhì)量檢查

8.2 部署前驗(yàn)證

9. 總結(jié)與最佳實(shí)踐

9.1 核心架構(gòu)價(jià)值

9.2 生產(chǎn)環(huán)境最佳實(shí)踐

9.3 擴(kuò)展方向

9.4 結(jié)語(yǔ)

相關(guān)文章

最新評(píng)論

大家感興趣的內(nèi)容

最近更新的內(nèi)容

常用在線小工具

国产无遮挡裸体免费直播视频,久久精品国产蜜臀av,动漫在线视频一区二区,欧亚日韩一区二区三区,久艹在线免费视频,国产精品美女网站免费,正在播放 97超级视频在线观看,斗破苍穹年番在线观看免费,51最新乱码中文字幕