Python機(jī)器學(xué)習(xí)NLP自然語言處理基本操作之京東評(píng)論分類

更新時(shí)間：2021年10月18日 15:18:26 作者：我是小白呀

自然語言處理( Natural Language Processing, NLP)是計(jì)算機(jī)科學(xué)領(lǐng)域與人工智能領(lǐng)域中的一個(gè)重要方向。它研究能實(shí)現(xiàn)人與計(jì)算機(jī)之間用自然語言進(jìn)行有效通信的各種理論和方法

概述

從今天開始我們將開啟一段自然語言處理 (NLP) 的旅程. 自然語言處理可以讓來處理, 理解, 以及運(yùn)用人類的語言, 實(shí)現(xiàn)機(jī)器語言和人類語言之間的溝通橋梁.

在這里插入圖片描述

RNN

RNN (Recurrent Neural Network), 即循環(huán)神經(jīng)網(wǎng)絡(luò). RNN 相較于 CNN, 可以幫助我們更好的處理序列信息, 挖掘前后信息之間的聯(lián)系. 對(duì)于 NLP 這類的任務(wù), 語料的前后概率有極大的聯(lián)系. 比如: “明天天氣真好” 的概率 > “明天天氣籃球”.

在這里插入圖片描述

權(quán)重共享

傳統(tǒng)神經(jīng)網(wǎng)絡(luò):

在這里插入圖片描述

RNN:

在這里插入圖片描述

RNN 的權(quán)重共享和 CNN 的權(quán)重共享類似, 不同時(shí)刻共享一個(gè)權(quán)重, 大大減少了參數(shù)數(shù)量.

計(jì)算過程

在這里插入圖片描述

計(jì)算狀態(tài) (State)

在這里插入圖片描述

計(jì)算輸出:

在這里插入圖片描述

LSTM

LSTM (Long Short Term Memory), 即長短期記憶模型. LSTM 是一種特殊的 RNN 模型, 解決了長序列訓(xùn)練過程中的梯度消失和梯度爆炸的問題. 相較于普通 RNN, LSTM 能夠在更長的序列中有更好的表現(xiàn). 相比 RNN 只有一個(gè)傳遞狀態(tài) ht, LSTM 有兩個(gè)傳遞狀態(tài)： ct (cell state) 和 ht (hidden state).

在這里插入圖片描述

階段

LSTM 通過門來控制傳輸狀態(tài)。

LSTM 總共分為三個(gè)階段:

忘記階段: 對(duì)上一個(gè)節(jié)點(diǎn)傳進(jìn)來的輸入進(jìn)行選擇性忘記
選擇記憶階段: 將這個(gè)階段的記憶有選擇性的進(jìn)行記憶. 哪些重要?jiǎng)t著重記錄下來, 哪些不重要, 則少記錄一些
輸出階段: 決定哪些將會(huì)被當(dāng)成當(dāng)前狀態(tài)的輸出

數(shù)據(jù)介紹

約 3 萬條評(píng)論數(shù)據(jù), 分為好評(píng)和差評(píng).

在這里插入圖片描述

好評(píng):

0 做父母一定要有劉墉這樣的心態(tài)，不斷地學(xué)習(xí)，不斷地進(jìn)步，不斷地給自己補(bǔ)充新鮮血液，讓自己保持一...
1 作者真有英國人嚴(yán)謹(jǐn)?shù)娘L(fēng)格，提出觀點(diǎn)、進(jìn)行論述論證，盡管本人對(duì)物理學(xué)了解不深，但是仍然能感受到...
2 作者長篇大論借用詳細(xì)報(bào)告數(shù)據(jù)處理工作和計(jì)算結(jié)果支持其新觀點(diǎn)。為什么荷蘭曾經(jīng)縣有歐洲最高的生產(chǎn)...
3 作者在戰(zhàn)幾時(shí)之前用了＂擁抱＂令人叫絕．日本如果沒有戰(zhàn)敗，就有會(huì)有美軍的占領(lǐng)，沒胡官僚主義的延...
4 作者在少年時(shí)即喜閱讀，能看出他精讀了無數(shù)經(jīng)典，因而他有一個(gè)龐大的內(nèi)心世界。他的作品最難能可貴...
5 作者有一種專業(yè)的謹(jǐn)慎，若能有幸學(xué)習(xí)原版也許會(huì)更好，簡體版的書中的印刷錯(cuò)誤比較多，影響學(xué)者理解...
6 作者用詩一樣的語言把如水般清澈透明的思想娓娓道來，像一個(gè)經(jīng)驗(yàn)豐富的智慧老人為我們解開一個(gè)又一...
7 作者提出了一種工作和生活的方式，作為咨詢界的元老，不僅能提出理念，而且能夠身體力行地實(shí)踐，并...
8 作者妙語連珠，將整個(gè)60-70年代用層出不窮的搖滾巨星與自身故事緊緊相連什么是鄉(xiāng)愁？什么是搖...
9 作者邏輯嚴(yán)密，一氣呵成。沒有一句廢話，深入淺出，循循善誘，環(huán)環(huán)相扣。讓平日里看到指標(biāo)圖釋就頭...

差評(píng):

0 做為一本聲名在外的流行書，說的還是廣州的外企，按道理應(yīng)該和我的生存環(huán)境差不多啊。但是一看之下...
1 作者有明顯的自戀傾向，只有有老公養(yǎng)不上班的太太們才能像她那樣生活。很多方法都不實(shí)用，還有抄襲...
2 作者完全是以一個(gè)過來的自認(rèn)為是成功者的角度去寫這個(gè)問題，感覺很不客觀。雖然不是很喜歡，但是，...
3 作者提倡內(nèi)調(diào)，不信任化妝品，這點(diǎn)贊同。但是所列舉的方法太麻煩，配料也不好找。不是太實(shí)用。
4 作者的文筆一般，觀點(diǎn)也是和市面上的同類書大同小異，不推薦讀者購買。
5 作者的文筆還行，但通篇感覺太瑣碎，有點(diǎn)文人的無病呻吟。自由主義者。作者的品性不敢茍同，無民族...
6 作者倒是個(gè)很小資的人,但有點(diǎn)自戀的感覺,書并沒有什么大幫助
7 作為一本描寫過去年代感情生活的小說，作者明顯生活經(jīng)驗(yàn)不足，并且文字功底極其一般，看后感覺浪費(fèi)...
8 作為個(gè)人經(jīng)驗(yàn)在網(wǎng)上談?wù)効梢?，但拿來出書就有點(diǎn)過了，書中還有些明顯的謬誤。不過文筆還不錯(cuò)，建議...
9 昨天剛興奮地寫了評(píng)論,今天便遇一鬧心事,因把此套書推薦給很多朋友,朋友就拖我在網(wǎng)上購,結(jié)果前...

代碼

預(yù)處理

import numpy as np
import pandas as pd
import jieba


# 讀取停用詞
stop_words = pd.read_csv("stopwords.txt", index_col=None, names=["stop_word"])
stop_words = stop_words["stop_word"].values.tolist()

def load_data():

    # 讀取數(shù)據(jù)
    neg = pd.read_excel("neg.xls", header=None)
    pos = pd.read_excel("pos.xls", header=None)

    # 調(diào)試輸出
    print(neg.head(10))
    print(pos.head(10))

    # 組合
    x = np.concatenate((pos[0], neg[0]))
    y = np.concatenate((np.ones(len(pos), dtype=int), np.zeros(len(neg), dtype=int)))

    # 生成df
    data = pd.DataFrame({"content": x, "label": y})
    print(data.head())


    data.to_csv("data.csv")

def pre_process(text):

    # 分詞
    text = jieba.lcut(text)


    # 去除數(shù)字
    text = [w for w in text if not str(w).isdigit()]

    # 去除左右空格
    text = list(filter(lambda w: w.strip(), text))

    # # 去除長度為1的字符
    # text = list(filter(lambda w: len(w) > 1, text))

    # 去除停用
    text = list(filter(lambda w: w not in stop_words, text))

    return " ".join(text)

if __name__ == '__main__':

    # 讀取數(shù)據(jù)
    data = pd.read_csv("data.csv")

    # 預(yù)處理
    data["content"] = data["content"].apply(pre_process)

    # 保存
    data.to_csv("processed.csv", index=False)

主函數(shù)

import pandas as pd
import tensorflow as tf
from sklearn.model_selection import train_test_split


def tokenizer():

    # 讀取數(shù)據(jù)
    data = pd.read_csv("processed.csv", index_col=False)
    print(data.head())

    # 轉(zhuǎn)換成元組
    X = tuple(data["content"])

    # 實(shí)例化tokenizer
    tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=30000)

    # 擬合
    tokenizer.fit_on_texts(X)

    # 詞袋
    word_index = tokenizer.word_index
    # print(word_index)
    print(len(word_index))

    # 轉(zhuǎn)換
    sequence = tokenizer.texts_to_sequences(X)

    # 填充
    characters = tf.keras.preprocessing.sequence.pad_sequences(sequence, maxlen=100)

    # 標(biāo)簽轉(zhuǎn)換
    labels = tf.keras.utils.to_categorical(data["label"])

    # 分割數(shù)據(jù)集
    X_train, X_test, y_train, y_test = train_test_split(characters, labels, test_size=0.2,
                                                        random_state=0)

    return X_train, X_test, y_train, y_test


def main():

    # 讀取分詞數(shù)據(jù)
    X_train, X_test, y_train, y_test = tokenizer()
    print(X_train[:5])
    print(y_train[:5])

    # 超參數(shù)
    EMBEDDING_DIM = 200  # embedding 維度
    optimizer = tf.keras.optimizers.RMSprop()  # 優(yōu)化器
    loss = tf.losses.CategoricalCrossentropy(from_logits=True)  # 損失

    # 模型
    model = tf.keras.Sequential([
        tf.keras.layers.Embedding(30001, EMBEDDING_DIM),
        tf.keras.layers.LSTM(200, dropout=0.2, recurrent_dropout=0.2),
        tf.keras.layers.Dropout(0.2),
        tf.keras.layers.Dense(64, activation="relu"),
        tf.keras.layers.Dense(2, activation="softmax")
    ])
    model.build(input_shape=[None, 20])
    print(model.summary())

    # 組合
    model.compile(optimizer=optimizer, loss=loss, metrics=["accuracy"])

    # 保存
    checkpoint = tf.keras.callbacks.ModelCheckpoint("model/jindong.h5py", monitor='val_accuracy', verbose=1,
                                                    save_best_only=True,
                                                    mode='max')

    # 訓(xùn)練
    model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=2, batch_size=32, callbacks=[checkpoint])


if __name__ == '__main__':
    main()

輸出結(jié)果:

Unnamed: 0 content label
0 0 做父母一定要有劉墉這樣的心態(tài) 不斷地學(xué)習(xí) 不斷地進(jìn)步不斷地給 ... 1
1 1 作者真有英國人嚴(yán)謹(jǐn) 的風(fēng)格提出觀點(diǎn) 進(jìn)行論述論證盡管本人對(duì) 物理學(xué) 了... 1
2 2 作者長篇大論借用詳細(xì) 報(bào)告數(shù)據(jù)處理工作和計(jì)算結(jié)果支持其新觀點(diǎn) 為什么荷... 1
3 3 作者在戰(zhàn) 幾時(shí) 之前用了＂擁抱＂令人叫絕．日本如果沒有戰(zhàn)敗就 ... 1
4 4 作者在少年時(shí)即喜閱讀能看出他精讀了無數(shù) 經(jīng)典因而他有一個(gè) 龐大... 1
49366
[[ 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 205 1808 119 40 56 2139 1246 434 3594 1321 1715
9 165 15 22]
[ 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 1157 8 3018 1 62 851 34 4 23 455 365
46 239 1157 3903]
[ 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 1579 53 388 958 294 1146 18 1 49 1146 305
2365 1 496 235]
[ 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 213 4719 509
730 21403 524 42]
[ 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 105 159 1 5 16 11
24 2 299 294 8 39 306 16796 11 1778 29 2674
640 2 543 1820]]
[[0. 1.]
[0. 1.]
[1. 0.]
[1. 0.]
[1. 0.]]
2021-09-20 18:59:07.031583: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'nvcuda.dll'; dlerror: nvcuda.dll not found
2021-09-20 18:59:07.031928: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
2021-09-20 18:59:07.037546: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: DESKTOP-VVCH1JQ
2021-09-20 18:59:07.037757: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: DESKTOP-VVCH1JQ
2021-09-20 18:59:07.043925: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (None, None, 200) 6000200
_________________________________________________________________
lstm (LSTM) (None, 200) 320800
_________________________________________________________________
dropout (Dropout) (None, 200) 0
_________________________________________________________________
dense (Dense) (None, 64) 12864
_________________________________________________________________
dense_1 (Dense) (None, 2) 130
=================================================================
Total params: 6,333,994
Trainable params: 6,333,994
Non-trainable params: 0
_________________________________________________________________
None
2021-09-20 18:59:07.470578: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
Epoch 1/2
C:\Users\Windows\Anaconda3\lib\site-packages\tensorflow\python\keras\backend.py:4870: UserWarning: "`categorical_crossentropy` received `from_logits=True`, but the `output` argument was produced by a sigmoid or softmax activation and thus does not represent logits. Was this intended?"
'"`categorical_crossentropy` received `from_logits=True`, but '
528/528 [==============================] - 272s 509ms/step - loss: 0.3762 - accuracy: 0.8476 - val_loss: 0.2835 - val_accuracy: 0.8839

Epoch 00001: val_accuracy improved from -inf to 0.88391, saving model to model\jindong.h5py
2021-09-20 19:03:40.563733: W tensorflow/python/util/util.cc:348] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
Epoch 2/2
528/528 [==============================] - 299s 566ms/step - loss: 0.2069 - accuracy: 0.9266 - val_loss: 0.2649 - val_accuracy: 0.9005

Epoch 00002: val_accuracy improved from 0.88391 to 0.90050, saving model to model\jindong.h5py

到此這篇關(guān)于Python機(jī)器學(xué)習(xí)NLP自然語言處理基本操作之京東評(píng)論分類的文章就介紹到這了,更多相關(guān)Python NLP 京東評(píng)論分類內(nèi)容請(qǐng)搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家！

您可能感興趣的文章:

国产无遮挡裸体免费直播视频,久久精品国产蜜臀av,动漫在线视频一区二区,欧亚日韩一区二区三区,久艹在线免费视频,国产精品美女网站免费,正在播放 97超级视频在线观看,斗破苍穹年番在线观看免费,51最新乱码中文字幕

Python機(jī)器學(xué)習(xí)NLP自然語言處理基本操作之京東評(píng)論分類

目錄

概述