Python中文糾錯的簡單實現(xiàn)

更新時間：2021年07月06日 10:33:11 作者：王大呀呀

這篇文章主要是用 Python 實現(xiàn)了簡單的中文分詞的同音字糾錯，目前的案例中只允許錯一個字，感興趣的小伙伴們可以參考一下

介紹

這篇文章主要是用 Python 實現(xiàn)了簡單的中文分詞的同音字糾錯，目前的案例中只允許錯一個字，自己如果有興趣可以繼續(xù)優(yōu)化下去。具體步驟如下所示：

先準(zhǔn)備一個文件，里面每一行中放一個中文分詞，我這里的文件是下面代碼中的 /Users/wys/Desktop/token.txt ，你們可以改成自己，再運行代碼
將構(gòu)建一個前綴樹類，實現(xiàn)插入功能，將所有的標(biāo)準(zhǔn)分詞都插入到前綴樹中，另外實現(xiàn)一個搜索功能，用來搜索分詞
將輸入的錯誤分詞中的每個字都找出 10 個同音字，將每個字都用 10 個同音字替換，結(jié)果可以最多得到 n*10 個分詞，n 為分詞的長度，因為有的音可能沒有 10 個同音字。
將這些分詞都經(jīng)過前綴樹的查找，如果能搜到，將其作為正確糾正就過返回

代碼

import re,pinyin
from Pinyin2Hanzi import DefaultDagParams
from Pinyin2Hanzi import dag

class corrector():
    def __init__(self):
        self.re_compile = re.compile(r'[\u4e00-\u9fff]')
        self.DAG = DefaultDagParams()

    # 將文件中的詞讀取
    def getData(self):
        words = []
        with open("/Users/wys/Desktop/token.txt") as f:
            for line in f.readlines():
                word = line.split(" ")[0]
                if word and len(word) > 2:
                    res = self.re_compile.findall(word)
                    if len(res) == len(word): ## 保證都是漢字組成的分詞
                        words.append(word)
        return words

    # 將每個拼音轉(zhuǎn)換成同音的 10 個候選漢字，
    def pinyin_2_hanzi(self, pinyinList):
        result = []
        words = dag(self.DAG, pinyinList, path_num=10)
        for item in words:
            res = item.path  # 轉(zhuǎn)換結(jié)果
            result.append(res[0])
        return result

    # 獲得詞經(jīng)過轉(zhuǎn)換的候選結(jié)結(jié)果
    def getCandidates(self, phrase):
        chars = {}
        for c in phrase:
            chars[c] = self.pinyin_2_hanzi(pinyin.get(c, format='strip', delimiter=',').split(','))
        replaces = []
        for c in phrase:
            for x in chars[c]:
                replaces.append(phrase.replace(c, x))
        return set(replaces)

    # 獲得糾錯之后的正確結(jié)果
    def getCorrection(self, words):
        result = []
        for word in words:
            for word in self.getCandidates(word):
                if Tree.search(word):
                    result.append(word)
                    break
        return result

class Node:
    def __init__(self):
        self.word = False
        self.child = {}


class Trie(object):
    def __init__(self):
        self.root = Node()

    def insert(self, words):
        for word in words:
            cur = self.root
            for w in word:
                if w not in cur.child:
                    cur.child[w] = Node()
                cur = cur.child[w]

            cur.word = True

    def search(self, word):
        cur = self.root
        for w in word:
            if w not in cur.child:
                return False
            cur = cur.child[w]

        if cur.word == False:
            return False
        return True

if __name__ == '__main__':
    # 初始化糾正器
    c = corrector()
    # 獲得單詞
    words = c.getData()
    # 初始化前綴樹
    Tree = Trie()
    # 將所有的單詞都插入到前綴樹中
    Tree.insert(words)
    # 測試
    print(c.getCorrection(['專塘街道','轉(zhuǎn)塘姐道','轉(zhuǎn)塘街到']))

結(jié)果