Python實(shí)現(xiàn)敏感詞過濾的4種方法

更新時(shí)間：2020年09月12日 15:46:00 作者：我被狗咬了

這篇文章主要介紹了Python實(shí)現(xiàn)敏感詞過濾的4種方法，幫助大家處理不和諧的言論，感興趣的朋友可以了解下

在我們生活中的一些場合經(jīng)常會(huì)有一些不該出現(xiàn)的敏感詞，我們通常會(huì)使用*去屏蔽它，例如：尼瑪 -> **，一些罵人的敏感詞和一些政治敏感詞都不應(yīng)該出現(xiàn)在一些公共場合中，這個(gè)時(shí)候我們就需要一定的手段去屏蔽這些敏感詞。下面我來介紹一些簡單版本的敏感詞屏蔽的方法。

（我已經(jīng)盡量把臟話做成圖片的形式了，要不然文章發(fā)不出去）

方法一：replace過濾

replace就是最簡單的字符串替換，當(dāng)一串字符串中有可能會(huì)出現(xiàn)的敏感詞時(shí)，我們直接使用相應(yīng)的replace方法用*替換出敏感詞即可。

缺點(diǎn)：

文本和敏感詞少的時(shí)候還可以，多的時(shí)候效率就比較差了

import datetime
now = datetime.datetime.now()
print(filter_sentence, " | ", now)

如果是多個(gè)敏感詞可以用列表進(jìn)行逐一替換

for i in dirty:
 speak = speak.replace(i, '*')
print(speak, " | ", now)

方法二：正則表達(dá)式過濾

正則表達(dá)式算是一個(gè)不錯(cuò)的匹配方法了，日常的查詢中，機(jī)會(huì)都會(huì)用到正則表達(dá)式，包括我們的爬蟲，也都是經(jīng)常會(huì)使用到正則表達(dá)式的，在這里我們主要是使用“|”來進(jìn)行匹配，“|”的意思是從多個(gè)目標(biāo)字符串中選擇一個(gè)進(jìn)行匹配。寫個(gè)簡單的例子：

import re

def sentence_filter(keywords, text):
 return re.sub("|".join(keywords), "***", text)

print(sentence_filter(dirty, speak))

方法三：DFA過濾算法

DFA的算法，即Deterministic Finite Automaton算法，翻譯成中文就是確定有窮自動(dòng)機(jī)算法。它的基本思想是基于狀態(tài)轉(zhuǎn)移來檢索敏感詞，只需要掃描一次待檢測文本，就能對所有敏感詞進(jìn)行檢測。（實(shí)現(xiàn)見代碼注釋）

#!/usr/bin/env python
# -*- coding:utf-8 -*-
# @Time：2020/4/15 11:40
# @Software：PyCharm
# article_add: https://www.cnblogs.com/JentZhang/p/12718092.html
__author__ = "JentZhang"
import json

MinMatchType = 1 # 最小匹配規(guī)則
MaxMatchType = 2 # 最大匹配規(guī)則


class DFAUtils(object):
 """
 DFA算法
 """

 def __init__(self, word_warehouse):
  """
  算法初始化
  :param word_warehouse:詞庫
  """
  # 詞庫
  self.root = dict()
  # 無意義詞庫,在檢測中需要跳過的（這種無意義的詞最后有個(gè)專門的地方維護(hù)，保存到數(shù)據(jù)庫或者其他存儲(chǔ)介質(zhì)中）
  self.skip_root = [' ', '&', '!', '！', '@', '#', '$', '￥', '*', '^', '%', '?', '？', '<', '>', "《", '》']
  # 初始化詞庫
  for word in word_warehouse:
   self.add_word(word)

 def add_word(self, word):
  """
  添加詞庫
  :param word:
  :return:
  """
  now_node = self.root
  word_count = len(word)
  for i in range(word_count):
   char_str = word[i]
   if char_str in now_node.keys():
    # 如果存在該key，直接賦值，用于下一個(gè)循環(huán)獲取
    now_node = now_node.get(word[i])
    now_node['is_end'] = False
   else:
    # 不存在則構(gòu)建一個(gè)dict
    new_node = dict()

    if i == word_count - 1: # 最后一個(gè)
     new_node['is_end'] = True
    else: # 不是最后一個(gè)
     new_node['is_end'] = False

    now_node[char_str] = new_node
    now_node = new_node

 def check_match_word(self, txt, begin_index, match_type=MinMatchType):
  """
  檢查文字中是否包含匹配的字符
  :param txt:待檢測的文本
  :param begin_index: 調(diào)用getSensitiveWord時(shí)輸入的參數(shù)，獲取詞語的上邊界index
  :param match_type:匹配規(guī)則 1：最小匹配規(guī)則，2：最大匹配規(guī)則
  :return:如果存在，則返回匹配字符的長度，不存在返回0
  """
  flag = False
  match_flag_length = 0 # 匹配字符的長度
  now_map = self.root
  tmp_flag = 0 # 包括特殊字符的敏感詞的長度

  for i in range(begin_index, len(txt)):
   word = txt[i]

   # 檢測是否是特殊字符"
   if word in self.skip_root and len(now_map) < 100:
    # len(nowMap)<100 保證已經(jīng)找到這個(gè)詞的開頭之后出現(xiàn)的特殊字符
    tmp_flag += 1
    continue

   # 獲取指定key
   now_map = now_map.get(word)
   if now_map: # 存在，則判斷是否為最后一個(gè)
    # 找到相應(yīng)key，匹配標(biāo)識(shí)+1
    match_flag_length += 1
    tmp_flag += 1
    # 如果為最后一個(gè)匹配規(guī)則，結(jié)束循環(huán)，返回匹配標(biāo)識(shí)數(shù)
    if now_map.get("is_end"):
     # 結(jié)束標(biāo)志位為true
     flag = True
     # 最小規(guī)則，直接返回,最大規(guī)則還需繼續(xù)查找
     if match_type == MinMatchType:
      break
   else: # 不存在，直接返回
    break

  if tmp_flag < 2 or not flag: # 長度必須大于等于1，為詞
   tmp_flag = 0
  return tmp_flag

 def get_match_word(self, txt, match_type=MinMatchType):
  """
  獲取匹配到的詞語
  :param txt:待檢測的文本
  :param match_type:匹配規(guī)則 1：最小匹配規(guī)則，2：最大匹配規(guī)則
  :return:文字中的相匹配詞
  """
  matched_word_list = list()
  for i in range(len(txt)): # 0---11
   length = self.check_match_word(txt, i, match_type)
   if length > 0:
    word = txt[i:i + length]
    matched_word_list.append(word)
    # i = i + length - 1
  return matched_word_list

 def is_contain(self, txt, match_type=MinMatchType):
  """
  判斷文字是否包含敏感字符
  :param txt:待檢測的文本
  :param match_type:匹配規(guī)則 1：最小匹配規(guī)則，2：最大匹配規(guī)則
  :return:若包含返回true，否則返回false
  """
  flag = False
  for i in range(len(txt)):
   match_flag = self.check_match_word(txt, i, match_type)
   if match_flag > 0:
    flag = True
  return flag

 def replace_match_word(self, txt, replace_char='*', match_type=MinMatchType):
  """
  替換匹配字符
  :param txt:待檢測的文本
  :param replace_char:用于替換的字符，匹配的敏感詞以字符逐個(gè)替換，如"你是大王八"，敏感詞"王八"，替換字符*，替換結(jié)果"你是大**"
  :param match_type:匹配規(guī)則 1：最小匹配規(guī)則，2：最大匹配規(guī)則
  :return:替換敏感字字符后的文本
  """
  tuple_set = self.get_match_word(txt, match_type)
  word_set = [i for i in tuple_set]
  result_txt = ""
  if len(word_set) > 0: # 如果檢測出了敏感詞，則返回替換后的文本
   for word in word_set:
    replace_string = len(word) * replace_char
    txt = txt.replace(word, replace_string)
    result_txt = txt
  else: # 沒有檢測出敏感詞，則返回原文本
   result_txt = txt
  return result_txt


if __name__ == '__main__':
 dfa = DFAUtils(word_warehouse=word_warehouse)
 print('詞庫結(jié)構(gòu)：', json.dumps(dfa.root, ensure_ascii=False))
 # 待檢測的文本
 msg = msg
 print('是否包含：', dfa.is_contain(msg))
 print('相匹配的詞：', dfa.get_match_word(msg))
 print('替換包含的詞：', dfa.replace_match_word(msg))

方法四：AC自動(dòng)機(jī)

AC自動(dòng)機(jī)需要有前置知識(shí)：Trie樹（簡單介紹：又稱前綴樹，字典樹，是用于快速處理字符串的問題，能做到快速查找到一些字符串上的信息。）

詳細(xì)參考：

https://www.luogu.com.cn/blog/juruohyfhaha/trie-xue-xi-zong-jie

ac自動(dòng)機(jī),就是在tire樹的基礎(chǔ)上,增加一個(gè)fail指針,如果當(dāng)前點(diǎn)匹配失敗,則將指針轉(zhuǎn)移到fail指針指向的地方,這樣就不用回溯,而可以路匹配下去了。

詳細(xì)匹配機(jī)制我在這里不過多贅述，關(guān)于AC自動(dòng)機(jī)可以參考一下這篇文章：

http://www.dhdzp.com/article/128711.htm

python可以利用ahocorasick模塊快速實(shí)現(xiàn)：

# python3 -m pip install pyahocorasick
import ahocorasick

def build_actree(wordlist):
 actree = ahocorasick.Automaton()
 for index, word in enumerate(wordlist):
  actree.add_word(word, (index, word))
 actree.make_automaton()
 return actree

if __name__ == '__main__':
 actree = build_actree(wordlist=wordlist)
 sent_cp = sent
 for i in actree.iter(sent):
  sent_cp = sent_cp.replace(i[1][1], "**")
  print("屏蔽詞：",i[1][1])
 print("屏蔽結(jié)果：",sent_cp)

當(dāng)然，我們也可以手寫一份AC自動(dòng)機(jī)，具體參考：

class TrieNode(object):
 __slots__ = ['value', 'next', 'fail', 'emit']

 def __init__(self, value):
  self.value = value
  self.next = dict()
  self.fail = None
  self.emit = None


class AhoCorasic(object):
 __slots__ = ['_root']

 def __init__(self, words):
  self._root = AhoCorasic._build_trie(words)

 @staticmethod
 def _build_trie(words):
  assert isinstance(words, list) and words
  root = TrieNode('root')
  for word in words:
   node = root
   for c in word:
    if c not in node.next:
     node.next[c] = TrieNode(c)
    node = node.next[c]
   if not node.emit:
    node.emit = {word}
   else:
    node.emit.add(word)
  queue = []
  queue.insert(0, (root, None))
  while len(queue) > 0:
   node_parent = queue.pop()
   curr, parent = node_parent[0], node_parent[1]
   for sub in curr.next.itervalues():
    queue.insert(0, (sub, curr))
   if parent is None:
    continue
   elif parent is root:
    curr.fail = root
   else:
    fail = parent.fail
    while fail and curr.value not in fail.next:
     fail = fail.fail
    if fail:
     curr.fail = fail.next[curr.value]
    else:
     curr.fail = root
  return root

 def search(self, s):
  seq_list = []
  node = self._root
  for i, c in enumerate(s):
   matched = True
   while c not in node.next:
    if not node.fail:
     matched = False
     node = self._root
     break
    node = node.fail
   if not matched:
    continue
   node = node.next[c]
   if node.emit:
    for _ in node.emit:
     from_index = i + 1 - len(_)
     match_info = (from_index, _)
     seq_list.append(match_info)
    node = self._root
  return seq_list


if __name__ == '__main__':
 aho = AhoCorasic(['foo', 'bar'])
 print aho.search('barfoothefoobarman')

以上便是使用Python實(shí)現(xiàn)敏感詞過濾的四種方法，前面兩種方法比較簡單，后面兩種偏向算法，需要先了解算法具體實(shí)現(xiàn)的原理，之后代碼就好懂了。（DFA作為比較常用的過濾手段，建議大家掌握一下~）

最后附上敏感詞詞庫：

https://github.com/qloog/sensitive_words

以上就是Python實(shí)現(xiàn)敏感詞過濾的4種方法的詳細(xì)內(nèi)容，更多關(guān)于python 敏感詞過濾的資料請關(guān)注腳本之家其它相關(guān)文章！

您可能感興趣的文章: