Pandas數(shù)據(jù)清洗之缺失值處理和重復(fù)值處理詳解

更新時(shí)間：2025年12月17日 09:00:43 作者：用戶(hù)6854537597769

這篇文章主要為大家詳細(xì)介紹了Pandas數(shù)據(jù)清洗之缺失值處理和重復(fù)值處理的相關(guān)知識(shí),文中的示例代碼講解詳細(xì),感興趣的小伙伴可以跟隨小編一起學(xué)習(xí)一下

先別急著動(dòng)手，先用這幾招看看數(shù)據(jù)全貌

import pandas as pd
import numpy as np

# 先造點(diǎn)"垃圾數(shù)據(jù)"模擬真實(shí)場(chǎng)景
data = {
    'name': ['張三', '李四', '王五', '張三', '趙六', np.nan, '錢(qián)七'],
    'age': [25, 30, np.nan, 25, '-', 35, 28],
    'salary': [8000, np.nan, 12000, 8000, 15000, 'N/A', 9000],
    'department': ['技術(shù)部', '銷(xiāo)售部', '技術(shù)部', '技術(shù)部', '人事部', '財(cái)務(wù)部', '技術(shù)部']
}
df = pd.DataFrame(data)

# 大部分人的做法（?錯(cuò)誤示范）
print(df.info())  # 只看到類(lèi)型和內(nèi)存，看不到實(shí)質(zhì)問(wèn)題

# 老司機(jī)的做法（?正確姿勢(shì)）
print("=" * 50)
print("數(shù)據(jù)缺失情況全貌：")
print(df.isnull().sum())
print("\n缺失值比例：")
print(df.isnull().sum() / len(df) * 100)

# 更騷的操作：一次性看透所有問(wèn)題
print("\n數(shù)據(jù)質(zhì)量報(bào)告：")
print(f"總行數(shù)：{len(df)}")
print(f"完全重復(fù)的行數(shù)：{df.duplicated().sum()}")
print(f"至少有一個(gè)缺失值的行數(shù)：{df.isnull().any(axis=1).sum()}")
print(f"全部是缺失值的行數(shù)：{df.isnull().all(axis=1).sum()}")

輸出結(jié)果：

==================================================
數(shù)據(jù)缺失情況全貌：
name 1
age 1
salary 2
department 0
dtype: int64

缺失值比例：
name 14.285714
age 14.285714
salary 28.571429
department 0.000000
dtype: float64

數(shù)據(jù)質(zhì)量報(bào)告：
總行數(shù)：7
完全重復(fù)的行數(shù)：1
至少有一個(gè)缺失值的行數(shù)：4
全部是缺失值的行數(shù)：0

看到了嗎？一眼就能看出：張三那行完全重復(fù)，salary字段問(wèn)題最多，還有各種奇怪的占位符。

缺失值處理：別只會(huì)用fillna(0)

很多人處理缺失值就是簡(jiǎn)單粗暴：

# ? 很多人的做法
df['age'].fillna(0, inplace=True)  # 年齡填0？用戶(hù)剛出生嗎？
df['salary'].fillna(0, inplace=True)  # 工資填0？要造反嗎？

兄弟，你這樣做，你的數(shù)據(jù)分析師會(huì)罵死你的！

真正的騷操作來(lái)了

1. 智能識(shí)別各種奇葩的缺失值

# 現(xiàn)實(shí)中的數(shù)據(jù)，缺失值花樣百出："-", "N/A", "暫無(wú)", "null", "..."
df = pd.read_excel('messy_data.xlsx', na_values=['-', 'N/A', '暫無(wú)數(shù)據(jù)', 'null', 'na', 'NaN'])

# 或者事后補(bǔ)救
df = df.replace(['-', 'N/A', '暫無(wú)數(shù)據(jù)', 'null', 'na', 'NaN'], np.nan)

print("清洗后的缺失值情況：")
print(df.isnull().sum())

2. 按列類(lèi)型分別處理，精準(zhǔn)打擊

# 按照數(shù)據(jù)類(lèi)型和業(yè)務(wù)邏輯分別處理
def smart_fillna(df):
    df_filled = df.copy()

    for col in df.columns:
        if df[col].dtype == 'object':  # 字符串類(lèi)型
            # 對(duì)于分類(lèi)變量，用眾數(shù)填充
            mode_value = df[col].mode()
            if len(mode_value) > 0:
                df_filled[col] = df[col].fillna(mode_value[0])

        elif df[col].dtype in ['int64', 'float64']:  # 數(shù)值類(lèi)型
            # 對(duì)于數(shù)值變量，按情況選擇填充策略
            if 'age' in col.lower():
                # 年齡用中位數(shù)填充（避免極端值影響）
                df_filled[col] = df[col].fillna(df[col].median())
            elif 'salary' in col.lower():
                # 薪資用同部門(mén)平均值填充
                for dept in df['department'].unique():
                    dept_avg = df[df['department'] == dept][col].mean()
                    mask = (df[col].isnull()) & (df['department'] == dept)
                    df_filled.loc[mask, col] = dept_avg
            else:
                # 其他數(shù)值用中位數(shù)
                df_filled[col] = df[col].fillna(df[col].median())

    return df_filled

df_smart_filled = smart_fillna(df)
print("智能填充結(jié)果：")
print(df_smart_filled)

3. 前后填充：時(shí)序數(shù)據(jù)的神器

# 模擬時(shí)序數(shù)據(jù)
time_series_data = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=10),
    'temperature': [20, 21, np.nan, 23, np.nan, np.nan, 26, 25, np.nan, 24],
    'sales': [100, 120, np.nan, 140, np.nan, 160, 170, np.nan, 190, 200]
})

# 前向填充（用前一個(gè)值填充）
print("前向填充：")
print(time_series_data.fillna(method='ffill'))

# 后向填充（用后一個(gè)值填充）
print("\n后向填充：")
print(time_series_data.fillna(method='bfill'))

# 組合拳：先前向，再后向
print("\n組合填充：")
print(time_series_data.fillna(method='ffill').fillna(method='bfill'))

# 限制填充范圍（避免連續(xù)缺失被錯(cuò)誤填充）
print("\n限制填充范圍（最多填充1個(gè)）：")
print(time_series_data.fillna(method='ffill', limit=1))

4. 插值：數(shù)值型數(shù)據(jù)的優(yōu)雅處理

# 線性插值
print("線性插值：")
print(time_series_data.interpolate(method='linear'))

# 多項(xiàng)式插值（更平滑）
print("\n二次多項(xiàng)式插值：")
print(time_series_data.interpolate(method='polynomial', order=2))

# 時(shí)間插值（考慮時(shí)間間隔）
print("\n時(shí)間插值：")
print(time_series_data.set_index('date').interpolate(method='time'))

重復(fù)值處理：不只是一行drop_duplicates

# 創(chuàng)建有重復(fù)數(shù)據(jù)的DataFrame
duplicate_data = pd.DataFrame({
    'name': ['張三', '李四', '張三', '張三', '王五'],
    'age': [25, 30, 25, 26, 35],  # 注意張三的年齡不一致
    'salary': [8000, 9000, 8000, 8000, 12000],
    'department': ['技術(shù)部', '銷(xiāo)售部', '技術(shù)部', '技術(shù)部', '技術(shù)部']
})

# ? 很多人只會(huì)這樣
print("簡(jiǎn)單去重（默認(rèn)保留第一個(gè)）：")
print(duplicate_data.drop_duplicates())

# ? 但實(shí)際情況要復(fù)雜得多
print("\n基于關(guān)鍵列去重（比如姓名+部門(mén)）：")
print(duplicate_data.drop_duplicates(subset=['name', 'department']))

print("\n保留最后一個(gè)出現(xiàn)的記錄：")
print(duplicate_data.drop_duplicates(subset=['name'], keep='last'))

print("\n標(biāo)記所有重復(fù)項(xiàng)（不保留任何）：")
print(duplicate_data[duplicate_data.duplicated(subset=['name'], keep=False)])

# ?? 神操作：智能處理重復(fù)數(shù)據(jù)
def smart_handle_duplicates(df, key_columns, strategy='first'):
    """
    智能處理重復(fù)數(shù)據(jù)

    Args:
        df: DataFrame
        key_columns: 用來(lái)判斷重復(fù)的關(guān)鍵列
        strategy: 'first', 'last', 'mean', 'max', 'min'
    """
    if strategy in ['first', 'last']:
        return df.drop_duplicates(subset=key_columns, keep=strategy)

    # 對(duì)于數(shù)值型數(shù)據(jù)，可以聚合處理
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    non_numeric_cols = df.select_dtypes(exclude=[np.number]).columns

    # 分組聚合
    result = df.groupby(key_columns, as_index=False).agg({
        **{col: 'first' for col in non_numeric_cols if col not in key_columns},
        **{col: strategy for col in numeric_cols if col not in key_columns}
    })

    return result

print("\n?? 智能聚合處理（保留第一次出現(xiàn)的字符串，薪資取平均值）：")
print(smart_handle_duplicates(duplicate_data, ['name'], 'mean'))

高級(jí)騷操作：一行代碼解決常見(jiàn)問(wèn)題

1. 一行代碼檢測(cè)所有數(shù)據(jù)問(wèn)題

def get_data_quality_report(df):
    """一行代碼生成數(shù)據(jù)質(zhì)量報(bào)告"""
    report = pd.DataFrame({
        '數(shù)據(jù)類(lèi)型': df.dtypes,
        '缺失值數(shù)量': df.isnull().sum(),
        '缺失值比例(%)': round(df.isnull().sum() / len(df) * 100, 2),
        '唯一值數(shù)量': df.nunique(),
        '重復(fù)值數(shù)量': df.duplicated().sum()
    })

    # 添加數(shù)據(jù)質(zhì)量評(píng)分
    quality_score = []
    for col in df.columns:
        missing_ratio = df[col].isnull().sum() / len(df)
        if missing_ratio == 0:
            score = 100
        elif missing_ratio < 0.1:
            score = 80
        elif missing_ratio < 0.3:
            score = 60
        else:
            score = 30
        quality_score.append(score)

    report['數(shù)據(jù)質(zhì)量評(píng)分'] = quality_score
    return report

print("數(shù)據(jù)質(zhì)量報(bào)告：")
print(get_data_quality_report(df))

2. 一行代碼處理90%的數(shù)據(jù)清洗

def auto_clean_data(df,
                   remove_duplicates=True,
                   duplicate_subset=None,
                   fill_missing=True,
                   fill_strategy='smart'):
    """
    自動(dòng)數(shù)據(jù)清洗函數(shù)

    Args:
        df: 要清洗的DataFrame
        remove_duplicates: 是否去重
        duplicate_subset: 去重依據(jù)的列
        fill_missing: 是否填充缺失值
        fill_strategy: 'smart', 'mean', 'median', 'mode', 'ffill', 'bfill'
    """
    df_clean = df.copy()

    # 1. 去重
    if remove_duplicates:
        if duplicate_subset:
            df_clean = df_clean.drop_duplicates(subset=duplicate_subset, keep='first')
        else:
            df_clean = df_clean.drop_duplicates(keep='first')

    # 2. 處理缺失值
    if fill_missing:
        if fill_strategy == 'smart':
            df_clean = smart_fillna(df_clean)
        elif fill_strategy == 'mean':
            df_clean = df_clean.fillna(df_clean.mean())
        elif fill_strategy == 'median':
            df_clean = df_clean.fillna(df_clean.median())
        elif fill_strategy == 'mode':
            df_clean = df_clean.fillna(df_clean.mode().iloc[0])
        elif fill_strategy == 'ffill':
            df_clean = df_clean.fillna(method='ffill')
        elif fill_strategy == 'bfill':
            df_clean = df_clean.fillna(method='bfill')

    # 3. 重置索引
    df_clean = df_clean.reset_index(drop=True)

    return df_clean

# ?? 一行代碼搞定
df_auto_cleaned = auto_clean_data(df, remove_duplicates=True, fill_strategy='smart')
print("自動(dòng)清洗結(jié)果：")
print(df_auto_cleaned)

實(shí)戰(zhàn)案例：10萬(wàn)行數(shù)據(jù)，3分鐘搞定

# 模擬真實(shí)的大數(shù)據(jù)場(chǎng)景
import time

def create_large_dirty_data(rows=100000):
    """創(chuàng)建大量臟數(shù)據(jù)"""
    np.random.seed(42)

    departments = ['技術(shù)部', '銷(xiāo)售部', '人事部', '財(cái)務(wù)部', '市場(chǎng)部']
    names = [f'員工{i}' for i in range(1000)]

    data = {
        '員工ID': range(1, rows + 1),
        '姓名': np.random.choice(names, rows, replace=True),
        '部門(mén)': np.random.choice(departments, rows),
        '年齡': np.random.normal(35, 8, rows),
        '薪資': np.random.normal(8000, 2000, rows),
        '績(jī)效評(píng)分': np.random.uniform(60, 100, rows)
    }

    df = pd.DataFrame(data)

    # 制造缺失值
    for col in ['年齡', '薪資', '績(jī)效評(píng)分']:
        missing_indices = np.random.choice(rows, int(rows * 0.15), replace=False)
        df.loc[missing_indices, col] = np.nan

    # 制造重復(fù)行
    duplicate_indices = np.random.choice(rows, int(rows * 0.05), replace=False)
    df = pd.concat([df, df.iloc[duplicate_indices]], ignore_index=True)

    # 制造異常值
    outlier_indices = np.random.choice(rows, int(rows * 0.02), replace=False)
    df.loc[outlier_indices, '年齡'] = df.loc[outlier_indices, '年齡'] * 10

    return df

# 性能測(cè)試
large_dirty_df = create_large_dirty_data(100000)
print(f"原始數(shù)據(jù)：{len(large_dirty_df)} 行")

start_time = time.time()

# 使用我們的自動(dòng)清洗函數(shù)
cleaned_large_df = auto_clean_data(
    large_dirty_df,
    remove_duplicates=True,
    duplicate_subset=['員工ID'],  # 基于員工ID去重
    fill_strategy='smart'
)

end_time = time.time()

print(f"清洗后數(shù)據(jù)：{len(cleaned_large_df)} 行")
print(f"處理時(shí)間：{end_time - start_time:.2f} 秒")
print(f"數(shù)據(jù)質(zhì)量提升：{(len(large_dirty_df) - len(cleaned_large_df)) / len(large_dirty_df) * 100:.1f}%")

總結(jié)：記住這幾條原則

先看診，再開(kāi)藥：永遠(yuǎn)先用isnull().sum()看看缺失值分布
區(qū)別對(duì)待：數(shù)值型和分類(lèi)型數(shù)據(jù)要分開(kāi)處理
業(yè)務(wù)優(yōu)先：年齡不能填0，薪資不能亂填
時(shí)序特殊：時(shí)間序列數(shù)據(jù)優(yōu)先考慮前向后向填充
重復(fù)慎刪：先確定哪些字段是判斷重復(fù)的依據(jù)

記住一句話(huà)：數(shù)據(jù)清洗不是技術(shù)活，是業(yè)務(wù)活。

下次再遇到臟數(shù)據(jù)，別慌。按照這套騷操作下來(lái)，10萬(wàn)行數(shù)據(jù)也就是一杯咖啡的時(shí)間。

你在項(xiàng)目中遇到過(guò)什么奇葩的數(shù)據(jù)清洗問(wèn)題？評(píng)論區(qū)聊聊，看看誰(shuí)的數(shù)據(jù)更臟？

到此這篇關(guān)于Pandas數(shù)據(jù)清洗之缺失值處理和重復(fù)值處理詳解的文章就介紹到這了,更多相關(guān)Pandas數(shù)據(jù)清洗內(nèi)容請(qǐng)搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家！

您可能感興趣的文章:

国产无遮挡裸体免费直播视频,久久精品国产蜜臀av,动漫在线视频一区二区,欧亚日韩一区二区三区,久艹在线免费视频,国产精品美女网站免费,正在播放 97超级视频在线观看,斗破苍穹年番在线观看免费,51最新乱码中文字幕

Pandas數(shù)據(jù)清洗之缺失值處理和重復(fù)值處理詳解

目錄

先別急著動(dòng)手，先用這幾招看看數(shù)據(jù)全貌

缺失值處理：別只會(huì)用fillna(0)

1. 智能識(shí)別各種奇葩的缺失值

2. 按列類(lèi)型分別處理，精準(zhǔn)打擊

3. 前后填充：時(shí)序數(shù)據(jù)的神器

4. 插值：數(shù)值型數(shù)據(jù)的優(yōu)雅處理

重復(fù)值處理：不只是一行drop_duplicates

高級(jí)騷操作：一行代碼解決常見(jiàn)問(wèn)題

1. 一行代碼檢測(cè)所有數(shù)據(jù)問(wèn)題

2. 一行代碼處理90%的數(shù)據(jù)清洗

實(shí)戰(zhàn)案例：10萬(wàn)行數(shù)據(jù)，3分鐘搞定

總結(jié)：記住這幾條原則

相關(guān)文章

最新評(píng)論

大家感興趣的內(nèi)容

最近更新的內(nèi)容

常用在線小工具

国产无遮挡裸体免费直播视频,久久精品国产蜜臀av,动漫在线视频一区二区,欧亚日韩一区二区三区,久艹在线 免费视频,国产精品美女网站免费,正在播放 97超级视频在线观看,斗破苍穹年番在线观看免费,51最新乱码中文字幕

Pandas數(shù)據(jù)清洗之缺失值處理和重復(fù)值處理詳解

目錄

先別急著動(dòng)手，先用這幾招看看數(shù)據(jù)全貌

缺失值處理：別只會(huì)用fillna(0)

1. 智能識(shí)別各種奇葩的缺失值

2. 按列類(lèi)型分別處理，精準(zhǔn)打擊

3. 前后填充：時(shí)序數(shù)據(jù)的神器

4. 插值：數(shù)值型數(shù)據(jù)的優(yōu)雅處理

重復(fù)值處理：不只是一行drop_duplicates

高級(jí)騷操作：一行代碼解決常見(jiàn)問(wèn)題

1. 一行代碼檢測(cè)所有數(shù)據(jù)問(wèn)題

2. 一行代碼處理90%的數(shù)據(jù)清洗

實(shí)戰(zhàn)案例：10萬(wàn)行數(shù)據(jù)，3分鐘搞定

總結(jié)：記住這幾條原則

相關(guān)文章

最新評(píng)論

大家感興趣的內(nèi)容

最近更新的內(nèi)容

常用在線小工具

国产无遮挡裸体免费直播视频,久久精品国产蜜臀av,动漫在线视频一区二区,欧亚日韩一区二区三区,久艹在线免费视频,国产精品美女网站免费,正在播放 97超级视频在线观看,斗破苍穹年番在线观看免费,51最新乱码中文字幕

先別急著動(dòng)手，先用這幾招看看數(shù)據(jù)全貌

2. 按列類(lèi)型分別處理，精準(zhǔn)打擊

實(shí)戰(zhàn)案例：10萬(wàn)行數(shù)據(jù)，3分鐘搞定