使用Python算法實(shí)現(xiàn)從字符串中提取重復(fù)子串

更新時(shí)間：2025年10月22日 09:37:04 作者：東方佑

在文本處理和數(shù)據(jù)分析中,經(jīng)常需要從字符串中提取重復(fù)出現(xiàn)的子串,本文將解析一個(gè)高效的Python算法,用于從給定字符串中提取長(zhǎng)度超過(guò)3的重復(fù)子串,需要的朋友可以參考下

算法概述

該算法包含兩個(gè)核心函數(shù)：

replace_text() - 預(yù)處理字符串，替換低頻字符
compute_sub_string_list() - 提取重復(fù)子串

1. 預(yù)處理函數(shù)：replace_text()

def replace_text(copy_text):
    replace_char = ""
    # 統(tǒng)計(jì)字符頻率
    for char, count in Counter(list(copy_text)).items():
        if count == 1:  # 只出現(xiàn)一次的字符
            if replace_char == "":
                replace_char = char  # 選擇第一個(gè)低頻字符作為替換字符
            else:
                # 用替換字符替換其他低頻字符
                copy_text = copy_text.replace(char, replace_char)
    
    if replace_char != "":
        # 分割字符串并篩選長(zhǎng)度>3的子串
        return [sub for sub, _ in Counter(copy_text.split(replace_char)).items() 
                if len(sub) > 3]
    else:
        return [copy_text]  # 沒(méi)有低頻字符時(shí)返回整個(gè)字符串

功能說(shuō)明：

找出所有只出現(xiàn)一次的字符
使用第一個(gè)低頻字符替換其他低頻字符
用替換字符分割字符串
返回長(zhǎng)度超過(guò)3的子串列表

2. 主處理函數(shù)：compute_sub_string_list()

def compute_sub_string_list(text1):
    text_list = replace_text(text1)  # 預(yù)處理
    new_text_list = []

    for one_text in text_list:
        if len(one_text) == 4:
            # 處理長(zhǎng)度為4的子串
            if one_text in new_text_list:
                continue
            if text1.count(one_text) > 1:
                new_text_list.append(one_text)
        else:
            # 處理長(zhǎng)度>4的子串
            max_count = 0
            max_str = ""
            while len(one_text) > 4:
                sub_str = one_text[:3]
                up_str_count = 0
                up_str = ""
                
                # 擴(kuò)展子串并檢查重復(fù)性
                for char in one_text[3:]:
                    sub_str += char
                    if sub_str in new_text_list:
                        continue
                    str_count = text1.count(sub_str)
                    if str_count > 1:
                        if up_str_count <= str_count:
                            up_str_count = str_count
                            up_str = sub_str
                    else:
                        break  # 停止擴(kuò)展
                
                # 更新最佳子串
                if up_str:
                    if up_str_count > max_count:
                        max_count = up_str_count
                        max_str = up_str
                
                # 滑動(dòng)窗口
                one_text = one_text[1:]
            
            if max_str:
                new_text_list.append(max_str)
    
    return new_text_list

功能說(shuō)明：

對(duì)預(yù)處理后的每個(gè)子串進(jìn)行處理
對(duì)于長(zhǎng)度為4的子串直接檢查重復(fù)性
對(duì)于更長(zhǎng)子串使用滑動(dòng)窗口技術(shù)：
1. 從3字符前綴開(kāi)始擴(kuò)展
2. 記錄出現(xiàn)次數(shù)最多的有效子串
3. 滑動(dòng)窗口繼續(xù)查找
返回所有符合條件的重復(fù)子串

算法優(yōu)勢(shì)

高效預(yù)處理：通過(guò)替換低頻字符優(yōu)化后續(xù)處理
智能子串?dāng)U展：動(dòng)態(tài)擴(kuò)展子串直到不再重復(fù)
滑動(dòng)窗口技術(shù)：高效遍歷所有可能子串
頻率優(yōu)先：優(yōu)先選擇出現(xiàn)次數(shù)最多的子串

使用示例

if __name__ == '__main__':
    text = "abracadabraabracadabra"
    result = compute_sub_string_list(text)
    print("重復(fù)子串:", result)
    # 輸出: ['abra', 'racad', 'acada', 'cadab', 'adabr']

應(yīng)用場(chǎng)景

文本模式識(shí)別
DNA序列分析
代碼重復(fù)檢測(cè)
自然語(yǔ)言處理中的短語(yǔ)提取
數(shù)據(jù)壓縮算法

這個(gè)算法通過(guò)巧妙的預(yù)處理和滑動(dòng)窗口技術(shù)，高效地從字符串中提取有意義的重復(fù)模式，特別適合處理包含重復(fù)模式的長(zhǎng)文本數(shù)據(jù)。

def replace_text(copy_text):
    replace_text = ""
    for i in Counter(list(copy_text)).items():
        if i[1] == 1:
            if replace_text == "":
                replace_text = i[0]
            else:
                copy_text = copy_text.replace(i[0], replace_text)
    if replace_text != "":

        return [i[0] for i in Counter(copy_text.split(replace_text)).items() if len(i[0]) > 3]
    else:
        return [copy_text]


def compute_sub_string_list(text1):
    copy_text = text1
    text_list = replace_text(copy_text)

    new_text_list = []

    for one_text in text_list:
        if len(one_text) == 4:
            if one_text in new_text_list:
                continue
            if text1.count(one_text) > 1:
                new_text_list.append(one_text)
        else:
            max_count = 0
            max_str = ""
            while True:

                sub_str = one_text[:3]
                up_str_count = 0
                up_str = ""

                for s in one_text[3:]:
                    sub_str += s
                    if sub_str in new_text_list:
                        continue
                    str_count = text1.count(sub_str)
                    if str_count > 1:
                        if up_str_count <= str_count:
                            up_str_count = str_count
                            up_str = sub_str
                    else:
                        break
                if up_str:
                    if up_str_count > max_count:
                        max_count = up_str_count
                        max_str = up_str
                if len(one_text) > 4:

                    one_text = one_text[1:]
                else:
                    break
            new_text_list.append(max_str)
    return new_text_list


if __name__ == '__main__':
    print()

以上就是使用Python算法實(shí)現(xiàn)從字符串中提取重復(fù)子串的詳細(xì)內(nèi)容，更多關(guān)于Python算法字符串提取重復(fù)子串的資料請(qǐng)關(guān)注腳本之家其它相關(guān)文章！

您可能感興趣的文章: