Pytorch實現(xiàn)基于CharRNN的文本分類與生成示例
1 簡介
本篇主要介紹使用pytorch實現(xiàn)基于CharRNN來進行文本分類與內(nèi)容生成所需要的相關(guān)知識,并最終給出完整的實現(xiàn)代碼。
2 相關(guān)API的說明
pytorch框架中每種網(wǎng)絡(luò)模型都有構(gòu)造函數(shù),在構(gòu)造函數(shù)中定義模型的靜態(tài)參數(shù),這些參數(shù)將對模型所包含weights參數(shù)的維度進行設(shè)置。在運行時,模型的實例將接收動態(tài)的tensor數(shù)據(jù)并調(diào)用forword,在得到模型輸出之后便可以和真實的標簽數(shù)據(jù)進行誤差計算,并通過優(yōu)化器進行反向傳播以調(diào)整模型的參數(shù)。下面重點介紹NLP常用到的模型和相關(guān)方法。
2.1 nn.Embedding
詞嵌入層是NLP應(yīng)用中常見的模塊。在word2vec出現(xiàn)之前,一種方法是使用每個token的one-hot向量進行運算。one-hot是一種稀疏編碼,運算效果較差。word2vec用于生成每個token的Dense向量表示。目前的研究結(jié)果證明,word2vec可以有效提升模型的訓(xùn)練效果。
pytorch的模型提供了Embedding模型用于實現(xiàn)詞嵌入過程Embedding層中的權(quán)重用于隨機初始化詞的向量,權(quán)重參數(shù)在后續(xù)的訓(xùn)練中會被不斷調(diào)整,并被優(yōu)化。
模型的創(chuàng)建方法為:embeding = nn.Embedding(vocab_size, embedding_dim)
vocab_size 表示字典的大小
embedding_dim 詞嵌入的維度數(shù)量,通常設(shè)置遠小于字典大小,60-300之間通??蓾M足需要
使用:embeded = embeding(input)
input 需要嵌入的句子,可為任意維度。單個句子表示為token的索引列表,如[283, 4092, 1, ]
output 數(shù)據(jù)的嵌入表示,shape=[*, embedding_dim],*為input的維度
示例代碼:
import torch from torch import nn embedding = nn.Embedding(5, 4) # 假定語料只有5個詞,詞向量維度為3 sents = [[1, 2, 3], [2, 3, 4]] # 兩個句子,how:1 are:2 you:3, are:2 you:3 ok:4 embed = embedding(torch.LongTensor(sents)) print(embed) # shape=(2 ''' tensor([[[-0.6991, -0.3340, -0.7701, -0.6255], [ 0.2969, 0.4720, -0.9403, 0.2982], [ 0.8902, -1.0681, 0.4035, 0.1645]], [[ 0.2969, 0.4720, -0.9403, 0.2982], [ 0.8902, -1.0681, 0.4035, 0.1645], [-0.7944, -0.1766, -1.5941, 0.4544]]], grad_fn=<EmbeddingBackward>) '''
2.2 nn.RNN
RNN是NLP的常用模型,普通的RNN單元結(jié)構(gòu)如下圖所示:

RNN單元還有一些變體,主要是單元內(nèi)部的激活函數(shù)不同或數(shù)據(jù)使用了不同計算。RNN每個單元存在輸入x與上一時刻的隱層狀態(tài)h,輸出有y與當前時刻的隱層狀態(tài)。
對RNN單元的改進有LSTM和GRU,這三種類型的模型的輸入數(shù)據(jù)都需要3D的tensor,,,使用時設(shè)置b atch_first為true時,輸入數(shù)據(jù)的shape為[batch,seq_length, input_dim],第一維為batch的數(shù)量不使用時設(shè)置為1,第二維序列的長度,第三維為輸入的維度,通常為詞嵌入的維度。
rnn = RNN(input_dim, hidden_dim, num_layers=1, batch_first, bidirectional)
input_dim 輸入token的特征數(shù)量,使用embeding時為嵌入的維度
hidden_dim 隱層的單元數(shù),決定RNN的輸出長度
num_layers 層數(shù)
batch_frist 第一維為batch,反之第一堆為seq_len,默認為False
bidirectional 是否為雙向RNN,默認為False
output, hidden = rnn(input, hidden)
input 一批輸入數(shù)據(jù),shape為[batch, seq_len, input_dim]
hidden 上一時刻的隱層狀態(tài),shape為[num_layers * num_directions, batch, hidden_dim]
output 當前時刻的輸出,shape為[batch, seq_len, num_directions*hidden_dim]
import torch from torch import nn vocab_size = 5 embed_dim = 3 hidden_dim = 8 embedding = nn.Embedding(vocab_size, embed_dim) rnn = nn.RNN(embed_dim, hidden_dim, batch_first=True) sents = [[1, 2, 4], [2, 3, 4]] h0 = torch.zeros(1, embeded.size(0), 8) # shape=(num_layers*num_directions, batch, hidden_dim) embeded = embedding(torch.LongTensor(sents)) out, hidden = rnn(embeded, h0) # out.shape=(2,3,8), hidden.shape=(1,2,8) print(out, hidden) ''' tensor([[[-0.1556, -0.2721, 0.1485, -0.2081, -0.2231, -0.1459, -0.0319, 0.2617], [-0.0274, 0.1561, -0.0509, -0.1723, -0.2678, -0.2616, 0.0786, 0.4124], [ 0.2346, 0.4487, -0.1409, -0.0807, -0.0232, -0.4975, 0.4244, 0.8337]], [[ 0.0879, 0.1122, 0.1502, -0.3033, -0.2715, -0.1191, 0.1367, 0.5275], [ 0.2258, 0.4395, -0.1365, 0.0135, -0.0777, -0.5221, 0.4683, 0.8115], [ 0.0158, 0.3471, 0.0742, -0.0550, -0.0098, -0.5521, 0.5923,0.8782]]], grad_fn=<TransposeBackward0>) tensor([[[ 0.2346, 0.4487, -0.1409, -0.0807, -0.0232, -0.4975, 0.4244, 0.8337], [ 0.0158, 0.3471, 0.0742, -0.0550, -0.0098, -0.5521, 0.5923, 0.8782]]], grad_fn=<ViewBackward>) '''
2.3 nn.LSTM
LSTM是RNN的一種模型,結(jié)構(gòu)中增加了記憶單元,LSTM單元結(jié)構(gòu)如下圖所示:

每個單元存在輸入x與上一時刻的隱層狀態(tài)h和上一次記憶c,輸出有y與當前時刻的隱層狀態(tài)及當前時刻的記憶c。其使用上和RNN類似。
lstm = LSTM(input_dim, hidden_dim, num_layers=1, batch_first=True, bidirectional)
input_dim 輸入word的特征數(shù)量,使用embeding時為嵌入的維度
hidden_dim 隱層的單元數(shù)
output, (hidden, cell) = lstm(input, (hidden, cell))
input 一批輸入數(shù)據(jù),shape為[batch, seq_len, input_dim]
hidden 當前時刻的隱層狀態(tài),shape為[num_layers * num_directions, batch, hidden_dim]
cell 當前時刻的記憶狀態(tài),shape為[num_layers * num_directions, batch, hidden_dim]
output 當前時刻的輸出,shape為[batch, seq_len, num_directions*hidden_dim]
2.4 nn.GRU
GRU也是一種RNN單元,但它比LSTM簡化許多,普通的GRU單元結(jié)構(gòu)如下圖所示:

每個單元存在輸入x與上一時刻的隱層狀態(tài)h,輸出有y與當前時刻的隱層狀態(tài)。
rnn = GRU(input_dim, hidden_dim, num_layers=1, batch_first=True, bidirectional)
input_dim 輸入word的特征數(shù)量,使用embeding時為嵌入的維度
hidden_dim 隱層的單元數(shù)
output, hidden = rnn(input, hidden)
input 一批輸入數(shù)據(jù),shape為[batch, seq_len, input_dim]
hidden 上一時刻的隱層狀態(tài),shape為[num_layers*num_directions, batch, hidden_dim]
output 當前時刻的輸出,shape為[batch, seq_len, num_directions*hidden_size]
2.5 損失函數(shù)
MSELoss均方誤差

輸入x,y可以是任意的shape,但要保持相同的shape
CrossEntropyLoss 交叉熵誤差

x : 包含每個類的得分,2-D tensor, shape=(batch, n)
class: 長度為batch 的 1D tensor,每個數(shù)值為類別的索引(0到 n-1)
3 字符級RNN的分類應(yīng)用
這里先介紹字符極詞向量的訓(xùn)練與使用。語料庫使用nltk的names語料庫,訓(xùn)練根據(jù)人名預(yù)測對應(yīng)的性別,names語料庫有兩個分類,female與male,每個分類下對應(yīng)約4000個人名。這個語料庫是比較適合字符級RNN的分類應(yīng)用,因為人名比較短,不能再做分詞以使用詞向量。
首次使用nltk的names語料庫要先下載下來,運行代碼nltk.download('names')即可。
字符級RNN模型的詞匯表很簡單,就是單個字符的集合,對于英文來說,只有26個字母,外加空格等會出現(xiàn)在名字中間的字符,見第14行代碼。出于簡化的目的,所有名字統(tǒng)一轉(zhuǎn)換為小寫。
神經(jīng)網(wǎng)絡(luò)很簡單,一層RNN網(wǎng)絡(luò),用于學(xué)習(xí)名字序列的特征。一層全連接網(wǎng)絡(luò),用于從將高維特征映射到性別的二分類上。這部分代碼由CharRNN類實現(xiàn)。這里沒有使用embeding層,而是使用字符的one-hot編碼,當然使用Embeding也是可以的。
網(wǎng)絡(luò)的訓(xùn)練和使用封裝為Model類,提供三個方法。train(), evaluate(),predict()分別用于訓(xùn)練,評估和預(yù)測使用。具體見下面的代碼及注釋。
import torch
from torch import nn
import torch.nn.functional as F
import numpy as np
import sklearn
import string
import random
nltk.download('names')
from nltk.corpus import names
USE_CUDA = torch.cuda.is_available()
device = torch.device("cuda" if USE_CUDA else "cpu")
chars = string.ascii_lowercase + '-' + ' ' + "'"
'''
將名字編碼為向量:每個字符為one-hot編碼,將多個字符的向量進行堆疊
abc = [ [1, 0, ...,0]
[0, 1, 0, ..]
[0, 0, 1, ..] ]
abc.shape = (len("abc"), len(chars))
'''
def name2vec(name):
ids = [chars.index(c) for c in name if c not in ["\\"]]
a = np.zeros(shape=(len(ids), len(chars)))
for i, idx in enumerate(ids):
a[i][idx] = 1
return a
def load_data():
female_file, male_file = names.fileids()
f1_names = names.words(female_file)
f2_names = names.words(male_file)
data_set = [(name.lower(), 0) for name in f1_names] + [(name.lower(), 1) for name in f2_names]
data_set = [(name2vec(name), sexy) for name, sexy in data_set]
random.shuffle(data_set)
return data_set
class CharRNN(nn.Module):
def __init__(self, vocab_size, hidden_size, output_size):
super(CharRNN, self).__init__()
self.vocab_size = vocab_size
self.hidden_size = hidden_size
self.output_size = output_size
self.rnn = nn.RNN(vocab_size, hidden_size, batch_first=True)
self.liner = nn.Linear(hidden_size, output_size)
def forward(self, input):
h0 = torch.zeros(1, 1, self.hidden_size, device=device) # 初始hidden state
output, hidden = self.rnn(input, h0)
output = output[:, -1, :] # 只使用最終時刻的輸出作為特征
output = self.liner(output)
output = F.softmax(output, dim=1)
return output
hidden_dim = 128
output_dim = 2
class Model:
def __init__(self, epoches=100):
self.model = CharRNN(len(chars), hidden_dim , output_dim)
self.model.to(device)
self.epoches = epoches
def train(self, train_set):
loss_func = nn.CrossEntropyLoss()
optimizer = torch.optim.RMSprop(self.model.parameters(), lr=0.0003)
for epoch in range(self.epoches):
total_loss = 0
for x in range(1000):# 每輪隨機樣本訓(xùn)練1000次
name, sexy = random.choice(train_set)
# RNN的input要求shape為[batch, seq_len, embed_dim],由于名字為變長,也不準備好將其填充為定長,因此batch_size取1,將取的名字放入單個元素的list中。
name_tensor = torch.tensor([name], dtype=torch.float, device=device)
# torch要求計算損失時,只提供類別的索引值,不需要one-hot表示
sexy_tensor = torch.tensor([sexy], dtype=torch.long, device=device)
optimizer.zero_grad()
pred = self.model(name_tensor) # [batch, out_dim]
loss = loss_func(pred, sexy_tensor)
loss.backward()
total_loss += loss
optimizer.step()
print("Training: in epoch {} loss {}".format(epoch, total_loss/1000))
def evaluate(self, test_set):
with torch.no_grad(): # 評估時不進行梯度計算
correct = 0
for x in range(1000): # 從測試集中隨機采樣測試1000次
name, sexy = random.choice(test_set)
name_tensor = torch.tensor([name], dtype=torch.float, device=device)
pred = self.model(name_tensor)
if torch.argmax(pred).item() == sexy:
correct += 1
print('Evaluating: test accuracy is {}%'.format(correct/10.0))
def predict(self, name):
p = name2vec(name.lower())
name_tensor = torch.tensor([p], dtype=torch.float, device=device)
with torch.no_grad():
out = self.model(name_tensor)
out = torch.argmax(out).item()
sexy = 'female' if out == 0 else 'male'
print('{} is {}'.format(name, sexy))
if __name__ == "__main__":
model = Model(10)
data_set = load_data()
train, test = sklearn.model_selection.train_test_split(data_set)
model.train(train)
model.evaluate(test)
model.predict("Jim")
model.predict('Kate')
'''
Evaluating: test accuracy is 82.6%
Jim is male
Kate is female
'''
4 基于字符級RNN的文本生成
文本生成的思想是,通過讓神經(jīng)網(wǎng)絡(luò)學(xué)習(xí)下一個輸出是哪個字符來訓(xùn)練權(quán)重參數(shù)。這里我們?nèi)允褂胣ames語料庫,嘗試訓(xùn)練一個生成指定性別人名的神經(jīng)網(wǎng)絡(luò)化。與分類不同的是分類只計算最終狀態(tài)輸出的誤差而生成要計算序列每一步計算上的誤差,因此訓(xùn)練時要逐個字符的輸入到網(wǎng)絡(luò)。由于是根據(jù)性別來生成人名,因此把性別的one-hot向量concat到輸入數(shù)據(jù)里,作為訓(xùn)練數(shù)據(jù)的一部分。
模型由類CharRNN實現(xiàn),模型的訓(xùn)練和使用由Model類實現(xiàn),提供了train(), sample()方法,前者用于訓(xùn)練模型,后者用于從訓(xùn)練中進行采樣生成。
# coding=utf-8
import torch
from torch import nn
import torch.nn.functional as F
import numpy as np
import string
import random
import nltk
nltk.download('names')
from nltk.corpus import names
USE_CUDA = torch.cuda.is_available()
device = torch.device("cuda" if USE_CUDA else "cpu")
# 使用符號!作為名字的結(jié)束標識
chars = string.ascii_lowercase + '-' + ' ' + "'" + '!'
hidden_dim = 128
output_dim = len(chars)
# name abc encode as [[1, ...], [0,1,...], [0,0,1...]]
def name2input(name):
ids = [chars.index(c) for c in name if c not in ["\\"]]
a = np.zeros(shape=(len(ids), len(chars)), dtype=np.long)
for i, idx in enumerate(ids):
a[i][idx] = 1
return a
# name abc encode as [0 1 2]
def name2target(name):
ids = [chars.index(c) for c in name if c not in ["\\"]]
return ids
# female=[[1, 0]] male=[[0,1]]
def sexy2input(sexy):
a = np.zeros(shape=(1, 2), dtype=np.long)
a[0][sexy] = 1
return a
def load_data():
female_file, male_file = names.fileids()
f1_names = names.words(female_file)
f2_names = names.words(male_file)
data_set = [(name.lower(), 0) for name in f1_names] + [(name.lower(), 1) for name in f2_names]
random.shuffle(data_set)
print(data_set[:10])
return data_set
'''
[('yoshiko', 0), ('timothea', 0), ('giorgi', 1), ('thedrick', 1), ('tessie', 0), ('keith', 1), ('carena', 0), ('anthea', 0), ('cathyleen', 0), ('almeta', 0)]
'''
class CharRNN(nn.Module):
def __init__(self, vocab_size, hidden_size, output_size):
super(CharRNN, self).__init__()
self.vocab_size = vocab_size
self.hidden_size = hidden_size
self.output_size = output_size
# 輸入維度增加了性別的one-hot嵌入,dim+=2
self.rnn = nn.GRU(vocab_size+2, hidden_size, batch_first=True)
self.liner = nn.Linear(hidden_size, output_size)
def forward(self, sexy, name, hidden=None):
if hidden is None:
hidden = torch.zeros(1, 1, self.hidden_size, device=device) # 初始hidden state
# 對每個輸入字符,將性別向量嵌入到頭部
input = torch.cat([sexy, name], dim=2)
output, hidden = self.rnn(input, hidden)
output = self.liner(output)
output = F.dropout(output, 0.3)
output = F.softmax(output, dim=2)
return output.view(1, -1), hidden
class Model:
def __init__(self, epoches):
self.model = CharRNN(len(chars), hidden_dim , output_dim)
self.model.to(device)
self.epoches = epoches
def train(self, train_set):
loss_func = nn.CrossEntropyLoss()
optimizer = torch.optim.RMSprop(self.model.parameters(), lr=0.001)
for epoch in range(self.epoches):
total_loss = 0
for x in range(1000): # 每輪隨機樣本訓(xùn)練1000次
loss = 0
name, sexy = random.choice(train_set)
optimizer.zero_grad()
hidden = torch.zeros(1, 1, hidden_dim, device=device)
# 對于姓名kate,將kate作為輸入,ate!作為訓(xùn)輸出,依次將每個字符輸入網(wǎng)絡(luò),以計算誤差
for x, y in zip(list(name), list(name[1:]+'!')):
name_tensor = torch.tensor([name2input(x)], dtype=torch.float, device=device)
sexy_tensor = torch.tensor([sexy2input(sexy)], dtype=torch.float, device=device)
target_tensor = torch.tensor(name2target(y), dtype=torch.long, device=device)
pred, hidden = self.model(sexy_tensor, name_tensor, hidden)
loss += loss_func(pred, target_tensor)
loss.backward()
optimizer.step()
total_loss += loss/(len(name) - 1)
print("Training: in epoch {} loss {}".format(epoch, total_loss/1000))
def sample(self, sexy, start):
max_len = 8
result = []
with torch.no_grad():
hidden = None
for c in start:
sexy_tensor = torch.tensor([sexy2input(sexy)], dtype=torch.float, device=device)
name_tensor = torch.tensor([name2input(c)], dtype=torch.float, device=device)
pred, hidden = self.model(sexy_tensor, name_tensor, hidden)
c = start[-1]
while c != '!':
sexy_tensor = torch.tensor([sexy2input(sexy)], dtype=torch.float, device=device)
name_tensor = torch.tensor([name2input(c)], dtype=torch.float, device=device)
pred, hidden = self.model(sexy_tensor, name_tensor, hidden)
topv, topi = pred.topk(1)
c = chars[topi]
# c = chars[torch.argmax(pred)]
result.append(c)
if len(result) > max_len:
break
return start + "".join(result[:-1])
if __name__ == "__main__":
model = Model(10)
data_set = load_data()
model.train(data_set)
print(model.sample(0, "ka"))
c = input('please input name prefix: ')
while c != 'q':
print(model.sample(1, c))
print(model.sample(0, c))
c = input('please input name prefix: ')
4 總結(jié)
通過這兩個實驗,可以發(fā)現(xiàn)深度學(xué)習(xí)可以以強有力的數(shù)據(jù)擬合能力來實現(xiàn)較好的數(shù)據(jù)分類及生成,但也要看到,深度學(xué)習(xí)并不理解人類的文本,還無任何創(chuàng)作能力。所謂的詩歌生成,繪畫等神經(jīng)網(wǎng)絡(luò)無非是盡量使生成內(nèi)容的概率分布與樣本類似而已,理解和推斷仍是機器所不具備的。
以上這篇Pytorch實現(xiàn)基于CharRNN的文本分類與生成示例就是小編分享給大家的全部內(nèi)容了,希望能給大家一個參考,也希望大家多多支持腳本之家。
相關(guān)文章
Python Matplotlib庫實現(xiàn)畫局部圖
這篇文章主要為大家詳細介紹了Python Matplotlib庫實現(xiàn)畫局部圖,文中示例代碼介紹的非常詳細,具有一定的參考價值,感興趣的小伙伴們可以參考一下2021-11-11

