伊人久久五月丁香综合中文亚洲,亚洲无线电影官网,日本久久久久亚洲中字幕

[Python 人工智能] 二十八.Keras深度學習中文文本分類萬字總結(CNN、TextCNN、BiLSTM、注意力)

網友投稿 1011 2025-03-31

本次實驗結果對比如下圖所示：

注意，本文代碼采用GPU+Pycharm實現，如果你的電腦是CPU實現，將相關GPU操作注釋即可。這里僅做簡單的對比實驗，不進行參數優化、實驗原因分析及詳細的效果提升，后面文章會介紹優化、參數選擇、實驗評估等。

文章目錄

一.文本分類概述

二.數據預處理及分詞

三.CNN中文文本分類

1.原理介紹

2.代碼實現

四.TextCNN中文文本分類

1.原理介紹

2.代碼實現

五.LSTM中文文本分類

1.原理介紹

2.代碼實現

六.BiLSTM中文文本分類

1.原理介紹

2.代碼實現

七.BiLSTM+Attention中文文本分類

1.原理介紹

2.代碼實現

八.總結

Keras-：https://github.com/eastmountyxz/AI-for-Keras

TensorFlow-：https://github.com/eastmountyxz/AI-for-TensorFlow

華為云社區前文賞析：

[Python人工智能] 一.TensorFlow2.0環境搭建及神經網絡入門

[Python人工智能] 二.TensorFlow基礎及一元直線預測案例

[Python人工智能] 三.TensorFlow基礎之Session、變量、傳入值和激勵函數

[Python人工智能] 四.TensorFlow創建回歸神經網絡及Optimizer優化器

[Python人工智能] 五.Tensorboard可視化基本用法及繪制整個神經網絡

[Python人工智能] 六.TensorFlow實現分類學習及MNIST手寫體識別案例

[Python人工智能] 七.什么是過擬合及dropout解決神經網絡中的過擬合問題

[Python人工智能] 八.卷積神經網絡CNN原理詳解及TensorFlow編寫CNN

[Python人工智能] 九.gensim詞向量Word2Vec安裝及《慶余年》中文短文本相似度計算

[Python人工智能] 十.Tensorflow+Opencv實現CNN自定義圖像分類及與KNN圖像分類對比

[Python人工智能] 十一.Tensorflow如何保存神經網絡參數

[Python人工智能] 十二.循環神經網絡RNN和LSTM原理詳解及TensorFlow編寫RNN分類案例

[Python人工智能] 十三.如何評價神經網絡、loss曲線圖繪制、圖像分類案例的F值計算

[Python人工智能] 十四.循環神經網絡LSTM RNN回歸案例之sin曲線預測丨【百變AI秀】

[Python人工智能] 十五.無監督學習Autoencoder原理及聚類可視化案例詳解

[Python人工智能] 十六.Keras環境搭建、入門基礎及回歸神經網絡案例

[Python人工智能] 十七.Keras搭建分類神經網絡及MNIST數字圖像案例分析

[Python人工智能] 十八.Keras搭建卷積神經網絡及CNN原理詳解

[Python人工智能] 十九.Keras搭建循環神經網絡分類案例及RNN原理詳解

[Python人工智能] 二十.基于Keras+RNN的文本分類vs基于傳統機器學習的文本分類

[Python人工智能] 二十一.Word2Vec+CNN中文文本分類詳解及與機器學習算法對比

[Python人工智能] 二十二.基于大連理工情感詞典的情感分析和情緒計算

[Python人工智能] 二十三.基于機器學習和TFIDF的情感分類（含詳細的NLP數據清洗）

[Python人工智能] 二十四.易學智能GPU搭建Keras環境實現LSTM惡意URL請求分類

[Python人工智能] 二十六.基于BiLSTM-CRF的醫學命名實體識別研究（上）數據預處理

[Python人工智能] 二十七.基于BiLSTM-CRF的醫學命名實體識別研究（下）模型構建

[Python人工智能] 二十八.Keras深度學習中文文本分類萬字總結(CNN、TextCNN、BiLSTM、注意力)

一.文本分類概述

文本分類旨在對文本集按照一定的分類體系或標準進行自動分類標記，屬于一種基于分類體系的自動分類。文本分類最早可以追溯到上世紀50年代，那時主要通過專家定義規則來進行文本分類；80年代出現了利用知識工程建立的專家系統；90年代開始借助于機器學習方法，通過人工特征工程和淺層分類模型來進行文本分類。現在多采用詞向量以及深度神經網絡來進行文本分類。

牛亞峰老師將傳統的文本分類流程歸納如下圖所示。在傳統的文本分類中，基本上大部分機器學習方法都在文本分類領域有所應用。主要包括：

Naive Bayes

KNN

SVM

集合類方法

最大熵

神經網絡

利用Keras框架進行文本分類的基本流程如下：

步驟 1：文本的預處理，分詞->去除停用詞->統計選擇top n的詞做為特征詞

步驟 2：為每個特征詞生成ID

步驟 3：將文本轉化成ID序列，并將左側補齊

步驟 4：訓練集shuffle

步驟 5：Embedding Layer 將詞轉化為詞向量

步驟 6：添加模型，構建神經網絡結構

步驟 7：訓練模型

步驟 8：得到準確率、召回率、F1值

注意，如果使用TFIDF而非詞向量進行文檔表示，則直接分詞去停后生成TFIDF矩陣后輸入模型。

深度學習文本分類方法包括：

卷積神經網絡(TextCNN)

循環神經網絡(TextRNN)

TextRNN+Attention

TextRCNN(TextRNN+CNN)

BiLSTM+Attention

遷移學習

推薦牛亞峰老師的文章：

基于 word2vec 和 CNN 的文本分類：綜述 & 實踐

二.數據預處理及分詞

這篇文章主要以代碼為主，算法原理知識前面的文章和后續文章再繼續補充。數據集如下圖所示：

訓練集：news_dataset_train.csv

游戲主題（10000）、體育主題（10000）、文化主題（10000）、財經主題（10000）

測試集：news_dataset_test.csv

游戲主題（5000）、體育主題（5000）、文化主題（5000）、財經主題（5000）

驗證集：news_dataset_val.csv

游戲主題（5000）、體育主題（5000）、文化主題（5000）、財經主題（5000）

首先需要進行中文分詞預處理，調用Jieba庫實現。代碼如下：

data_preprocess.py

# -*- coding:utf-8 -*- # By:Eastmount CSDN 2021-03-19 import csv import pandas as pd import numpy as np import jieba import jieba.analyse #添加自定義詞典和停用詞典 jieba.load_userdict("user_dict.txt") stop_list = pd.read_csv('stop_words.txt', engine='python', encoding='utf-8', delimiter="\n", names=['t'])['t'].tolist() #----------------------------------------------------------------------- #Jieba分詞函數 def txt_cut(juzi): return [w for w in jieba.lcut(juzi) if w not in stop_list] #----------------------------------------------------------------------- #中文分詞讀取文件 def fenci(filename,result): #寫入分詞結果 fw = open(result, "w", newline = '',encoding = 'gb18030') writer = csv.writer(fw) writer.writerow(['label','cutword']) #使用csv.DictReader讀取文件中的信息 labels = [] contents = [] with open(filename, "r", encoding="UTF-8") as f: reader = csv.DictReader(f) for row in reader: #數據元素獲取 labels.append(row['label']) content = row['content'] #中文分詞 seglist = txt_cut(content) #空格拼接 output = ' '.join(list(seglist)) contents.append(output) #文件寫入 tlist = [] tlist.append(row['label']) tlist.append(output) writer.writerow(tlist) print(labels[:5]) print(contents[:5]) fw.close() #----------------------------------------------------------------------- #主函數 if __name__ == '__main__': fenci("news_dataset_train.csv", "news_dataset_train_fc.csv") fenci("news_dataset_test.csv", "news_dataset_test_fc.csv") fenci("news_dataset_val.csv", "news_dataset_val_fc.csv")

運行結果如下圖所示：

接著我們嘗試簡單查看數據的長度分布情況及標簽可視化。

data_show.py

# -*- coding: utf-8 -*- """ Created on 2021-03-19 @author: xiuzhang Eastmount CSDN """ import pandas as pd import numpy as np from sklearn import metrics import matplotlib.pyplot as plt import seaborn as sns #---------------------------------------第一步數據讀取------------------------------------ ## 讀取測數據集 train_df = pd.read_csv("news_dataset_train_fc.csv") val_df = pd.read_csv("news_dataset_val_fc.csv") test_df = pd.read_csv("news_dataset_test_fc.csv") print(train_df.head()) ## 解決中文顯示問題 plt.rcParams['font.sans-serif'] = ['KaiTi'] #指定默認字體 SimHei黑體 plt.rcParams['axes.unicode_minus'] = False #解決保存圖像是負號' ## 查看訓練集都有哪些標簽 plt.figure() sns.countplot(train_df.label) plt.xlabel('Label',size = 10) plt.xticks(size = 10) plt.show() ## 分析訓練集中詞組數量的分布 print(train_df.cutwordnum.describe()) plt.figure() plt.hist(train_df.cutwordnum,bins=100) plt.xlabel("詞組長度", size = 12) plt.ylabel("頻數", size = 12) plt.title("訓練數據集") plt.show()

輸出結果如下圖所示，后面的文章我們會介紹論文如何繪制好看的圖表。

注意，如果報錯“UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xce in position 17: invalid continuation byte”，需要將CSV文件保存為UTF-8格式，如下圖所示。

三.CNN中文文本分類

1.原理介紹

卷積神經網絡（Convolutional Neural Networks, CNN）是一類包含卷積計算且具有深度結構的前饋神經網絡（Feedforward Neural Networks），是深度學習（deep learning）的代表算法之一。它通常應用于圖像識別和語音識等領域，并能給出更優秀的結果，也可以應用于視頻分析、機器翻譯、自然語言處理、藥物發現等領域。著名的阿爾法狗讓計算機看懂圍棋就是基于卷積神經網絡的。

卷積是指不在對每個像素做處理，而是對圖片區域進行處理，這種做法加強了圖片的連續性，看到的是一個圖形而不是一個點，也加深了神經網絡對圖片的理解。

通常卷積神經網絡會依次經歷“圖片->卷積->持化->卷積->持化->結果傳入兩層全連接神經層->分類器”的過程，最終實現一個CNN的分類處理。

2.代碼實現

Keras實現文本分類的CNN代碼如下：

Keras_CNN_cnews.py

# -*- coding: utf-8 -*- """ Created on 2021-03-19 @author: xiuzhang Eastmount CSDN CNN Model """ import os import time import pickle import pandas as pd import numpy as np from sklearn import metrics import matplotlib.pyplot as plt import seaborn as sns import tensorflow as tf from sklearn.preprocessing import LabelEncoder,OneHotEncoder from keras.models import Model from keras.layers import LSTM, Activation, Dense, Dropout, Input, Embedding from keras.layers import Convolution1D, MaxPool1D, Flatten from keras.preprocessing.text import Tokenizer from keras.preprocessing import sequence from keras.callbacks import EarlyStopping from keras.models import load_model from keras.models import Sequential ## GPU處理讀者如果是CPU注釋該部分代碼即可 ## 指定每個GPU進程中使用顯存的上限 0.9表示可以使用GPU 90%的資源進行訓練 os.environ["CUDA_DEVICES_ORDER"] = "PCI_BUS_IS" os.environ["CUDA_VISIBLE_DEVICES"] = "0" gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.8) sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options)) start = time.clock() #----------------------------第一步數據讀取---------------------------- ## 讀取測數據集 train_df = pd.read_csv("news_dataset_train_fc.csv") val_df = pd.read_csv("news_dataset_val_fc.csv") test_df = pd.read_csv("news_dataset_test_fc.csv") print(train_df.head()) ## 解決中文顯示問題 plt.rcParams['font.sans-serif'] = ['KaiTi'] #指定默認字體 SimHei黑體 plt.rcParams['axes.unicode_minus'] = False #解決保存圖像是負號' #--------------------------第二步 OneHotEncoder()編碼-------------------- ## 對數據集的標簽數據進行編碼 train_y = train_df.label val_y = val_df.label test_y = test_df.label print("Label:") print(train_y[:10]) le = LabelEncoder() train_y = le.fit_transform(train_y).reshape(-1,1) val_y = le.transform(val_y).reshape(-1,1) test_y = le.transform(test_y).reshape(-1,1) print("LabelEncoder") print(train_y[:10]) print(len(train_y)) ## 對數據集的標簽數據進行one-hot編碼 ohe = OneHotEncoder() train_y = ohe.fit_transform(train_y).toarray() val_y = ohe.transform(val_y).toarray() test_y = ohe.transform(test_y).toarray() print("OneHotEncoder:") print(train_y[:10]) #-----------------------第三步使用Tokenizer對詞組進行編碼-------------------- max_words = 6000 max_len = 600 tok = Tokenizer(num_words=max_words) #最大詞語數為6000 print(train_df.cutword[:5]) print(type(train_df.cutword)) ## 防止語料中存在數字str處理 train_content = [str(a) for a in train_df.cutword.tolist()] val_content = [str(a) for a in val_df.cutword.tolist()] test_content = [str(a) for a in test_df.cutword.tolist()] tok.fit_on_texts(train_content) print(tok) #當創建Tokenizer對象后使用fit_on_texts()函數識別每個詞 #tok.fit_on_texts(train_df.cutword) ## 保存訓練好的Tokenizer和導入 with open('tok.pickle', 'wb') as handle: #saving pickle.dump(tok, handle, protocol=pickle.HIGHEST_PROTOCOL) with open('tok.pickle', 'rb') as handle: #loading tok = pickle.load(handle) ## 使用word_index屬性查看每個詞對應的編碼 ## 使用word_counts屬性查看每個詞對應的頻數 for ii,iterm in enumerate(tok.word_index.items()): if ii < 10: print(iterm) else: break print("===================") for ii,iterm in enumerate(tok.word_counts.items()): if ii < 10: print(iterm) else: break #---------------------------第四步數據轉化為序列----------------------------- ## 使用sequence.pad_sequences()將每個序列調整為相同的長度 ## 對每個詞編碼之后，每句新聞中的每個詞就可以用對應的編碼表示，即每條新聞可以轉變成一個向量了 train_seq = tok.texts_to_sequences(train_content) val_seq = tok.texts_to_sequences(val_content) test_seq = tok.texts_to_sequences(test_content) ## 將每個序列調整為相同的長度 train_seq_mat = sequence.pad_sequences(train_seq,maxlen=max_len) val_seq_mat = sequence.pad_sequences(val_seq,maxlen=max_len) test_seq_mat = sequence.pad_sequences(test_seq,maxlen=max_len) print("數據轉換序列") print(train_seq_mat.shape) print(val_seq_mat.shape) print(test_seq_mat.shape) print(train_seq_mat[:2]) #-------------------------------第五步建立CNN模型-------------------------- ## 類別為4個 num_labels = 4 inputs = Input(name='inputs',shape=[max_len], dtype='float64') ## 詞嵌入使用預訓練的詞向量 layer = Embedding(max_words+1, 128, input_length=max_len, trainable=False)(inputs) ## 卷積層和池化層(詞窗大小為3 128核) cnn = Convolution1D(128, 3, padding='same', strides = 1, activation='relu')(layer) cnn = MaxPool1D(pool_size=4)(cnn) ## Dropout防止過擬合 flat = Flatten()(cnn) drop = Dropout(0.3)(flat) ## 全連接層 main_output = Dense(num_labels, activation='softmax')(drop) model = Model(inputs=inputs, outputs=main_output) ## 優化函數評價指標 model.summary() model.compile(loss="categorical_crossentropy", optimizer='adam', # RMSprop() metrics=["accuracy"]) #-------------------------------第六步模型訓練和預測-------------------------- ## 先設置為train訓練再設置為test測試 flag = "train" if flag == "train": print("模型訓練") ## 模型訓練當val-loss不再提升時停止訓練 0.0001 model_fit = model.fit(train_seq_mat, train_y, batch_size=128, epochs=10, validation_data=(val_seq_mat,val_y), callbacks=[EarlyStopping(monitor='val_loss',min_delta=0.0001)] ) ## 保存模型 model.save('my_model.h5') del model # deletes the existing model ## 計算時間 elapsed = (time.clock() - start) print("Time used:", elapsed) print(model_fit.history) else: print("模型預測") ## 導入已經訓練好的模型 model = load_model('my_model.h5') ## 對測試集進行預測 test_pre = model.predict(test_seq_mat) ## 評價預測效果，計算混淆矩陣 confm = metrics.confusion_matrix(np.argmax(test_y,axis=1),np.argmax(test_pre,axis=1)) print(confm) ## 混淆矩陣可視化 Labname = ["體育", "文化", "財經", "游戲"] print(metrics.classification_report(np.argmax(test_y,axis=1),np.argmax(test_pre,axis=1))) plt.figure(figsize=(8,8)) sns.heatmap(confm.T, square=True, annot=True, fmt='d', cbar=False, linewidths=.6, cmap="YlGnBu") plt.xlabel('True label',size = 14) plt.ylabel('Predicted label', size = 14) plt.xticks(np.arange(4)+0.5, Labname, size = 12) plt.yticks(np.arange(4)+0.5, Labname, size = 12) plt.savefig('result.png') plt.show() #----------------------------------第七驗證算法-------------------------- ## 使用tok對驗證數據集重新預處理，并使用訓練好的模型進行預測 val_seq = tok.texts_to_sequences(val_df.cutword) ## 將每個序列調整為相同的長度 val_seq_mat = sequence.pad_sequences(val_seq,maxlen=max_len) ## 對驗證集進行預測 val_pre = model.predict(val_seq_mat) print(metrics.classification_report(np.argmax(val_y,axis=1),np.argmax(val_pre,axis=1))) ## 計算時間 elapsed = (time.clock() - start) print("Time used:", elapsed)

GPU運行如下圖所示。注意，如果您的電腦是CPU版本，只需要將上述代碼第一部分注釋掉即可，后面LSTM部分使用GPU對應的庫函數。

訓練輸出模型如下圖所示：

訓練輸出結果如下：

模型訓練 Train on 40000 samples, validate on 20000 samples Epoch 1/10 40000/40000 [==============================] - 15s 371us/step - loss: 1.1798 - acc: 0.4772 - val_loss: 0.9878 - val_acc: 0.5977 Epoch 2/10 40000/40000 [==============================] - 4s 93us/step - loss: 0.8681 - acc: 0.6612 - val_loss: 0.8167 - val_acc: 0.6746 Epoch 3/10 40000/40000 [==============================] - 4s 92us/step - loss: 0.7268 - acc: 0.7245 - val_loss: 0.7084 - val_acc: 0.7330 Epoch 4/10 40000/40000 [==============================] - 4s 93us/step - loss: 0.6369 - acc: 0.7643 - val_loss: 0.6462 - val_acc: 0.7617 Epoch 5/10 40000/40000 [==============================] - 4s 96us/step - loss: 0.5670 - acc: 0.7957 - val_loss: 0.5895 - val_acc: 0.7867 Epoch 6/10 40000/40000 [==============================] - 4s 92us/step - loss: 0.5074 - acc: 0.8226 - val_loss: 0.5530 - val_acc: 0.8018 Epoch 7/10 40000/40000 [==============================] - 4s 93us/step - loss: 0.4638 - acc: 0.8388 - val_loss: 0.5105 - val_acc: 0.8185 Epoch 8/10 40000/40000 [==============================] - 4s 93us/step - loss: 0.4241 - acc: 0.8545 - val_loss: 0.4836 - val_acc: 0.8304 Epoch 9/10 40000/40000 [==============================] - 4s 92us/step - loss: 0.3900 - acc: 0.8692 - val_loss: 0.4599 - val_acc: 0.8403 Epoch 10/10 40000/40000 [==============================] - 4s 93us/step - loss: 0.3657 - acc: 0.8761 - val_loss: 0.4472 - val_acc: 0.8457 Time used: 52.203992899999996

預測及驗證結果如下：

[[3928 472 264 336] [ 115 4529 121 235] [ 151 340 4279 230] [ 145 593 195 4067]] precision recall f1-score support 0 0.91 0.79 0.84 5000 1 0.76 0.91 0.83 5000 2 0.88 0.86 0.87 5000 3 0.84 0.81 0.82 5000 avg / total 0.85 0.84 0.84 20000 precision recall f1-score support 0 0.90 0.77 0.83 5000 1 0.78 0.92 0.84 5000 2 0.88 0.85 0.86 5000 3 0.84 0.85 0.85 5000 avg / total 0.85 0.85 0.85 20000

四.TextCNN中文文本分類

1.原理介紹

TextCNN 是利用卷積神經網絡對文本進行分類的算法，由 Yoon Kim 于2014年在 “Convolutional Neural Networks for Sentence Classification” 一文中提出的算法。

卷積神經網絡的核心思想是捕捉局部特征，對于文本來說，局部特征就是由若干單詞組成的滑動窗口，類似于N-gram。卷積神經網絡的優勢在于能夠自動地對N-gram特征進行組合和篩選，獲得不同抽象層次的語義信息。下圖是該論文中用于文本分類的卷積神經網絡模型架構。

另一篇TextCNN比較經典的論文是《A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification》，其模型結果如下圖所示。主要用于文本分類任務的TextCNN結構描述，詳細解釋了TextCNN架構及詞向量矩陣是如何做卷積的。

假設我們有一些句子需要對其進行分類。句子中每個詞是由n維詞向量組成的，也就是說輸入矩陣大小為m*n，其中m為句子長度。CNN需要對輸入樣本進行卷積操作，對于文本數據，filter不再橫向滑動，僅僅是向下移動，有點類似于N-gram在提取詞與詞間的局部相關性。

圖中共有三種步長策略，分別是2、3、4，每個步長都有兩個filter（實際訓練時filter數量會很多）。在不同詞窗上應用不同filter，最終得到6個卷積后的向量。然后對每一個向量進行最大化池化操作并拼接各個池化值，最終得到這個句子的特征表示，將這個句子向量丟給分類器進行分類，最終完成整個文本分類流程。

最后真心推薦下面這些大佬關于TextCNN的介紹，尤其是CSDN的Asia-Lee大佬，很喜歡他的文章，真心棒！

《Convolutional Neural Networks for Sentence Classification2014》

《A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification》

https://blog.csdn.net/asialee_bird/article/details/88813385

https://zhuanlan.zhihu.com/p/77634533

2.代碼實現

Keras實現文本分類的TextCNN代碼如下：

Keras_TextCNN_cnews.py

# -*- coding: utf-8 -*- """ Created on 2021-03-19 @author: xiuzhang Eastmount CSDN TextCNN Model """ import os import time import pickle import pandas as pd import numpy as np from sklearn import metrics import matplotlib.pyplot as plt import seaborn as sns import tensorflow as tf from sklearn.preprocessing import LabelEncoder,OneHotEncoder from keras.models import Model from keras.layers import LSTM, Activation, Dense, Dropout, Input, Embedding from keras.layers import Convolution1D, MaxPool1D, Flatten from keras.preprocessing.text import Tokenizer from keras.preprocessing import sequence from keras.callbacks import EarlyStopping from keras.models import load_model from keras.models import Sequential from keras.layers.merge import concatenate ## GPU處理讀者如果是CPU注釋該部分代碼即可 ## 指定每個GPU進程中使用顯存的上限 0.9表示可以使用GPU 90%的資源進行訓練 os.environ["CUDA_DEVICES_ORDER"] = "PCI_BUS_IS" os.environ["CUDA_VISIBLE_DEVICES"] = "0" gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.8) sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options)) start = time.clock() #----------------------------第一步數據讀取---------------------------- ## 讀取測數據集 train_df = pd.read_csv("news_dataset_train_fc.csv") val_df = pd.read_csv("news_dataset_val_fc.csv") test_df = pd.read_csv("news_dataset_test_fc.csv") ## 解決中文顯示問題 plt.rcParams['font.sans-serif'] = ['KaiTi'] #指定默認字體 SimHei黑體 plt.rcParams['axes.unicode_minus'] = False #解決保存圖像是負號' #--------------------------第二步 OneHotEncoder()編碼-------------------- ## 對數據集的標簽數據進行編碼 train_y = train_df.label val_y = val_df.label test_y = test_df.label print("Label:") print(train_y[:10]) le = LabelEncoder() train_y = le.fit_transform(train_y).reshape(-1,1) val_y = le.transform(val_y).reshape(-1,1) test_y = le.transform(test_y).reshape(-1,1) print("LabelEncoder") print(train_y[:10]) print(len(train_y)) ## 對數據集的標簽數據進行one-hot編碼 ohe = OneHotEncoder() train_y = ohe.fit_transform(train_y).toarray() val_y = ohe.transform(val_y).toarray() test_y = ohe.transform(test_y).toarray() print("OneHotEncoder:") print(train_y[:10]) #-----------------------第三步使用Tokenizer對詞組進行編碼-------------------- max_words = 6000 max_len = 600 tok = Tokenizer(num_words=max_words) #最大詞語數為6000 print(train_df.cutword[:5]) print(type(train_df.cutword)) ## 防止語料中存在數字str處理 train_content = [str(a) for a in train_df.cutword.tolist()] val_content = [str(a) for a in val_df.cutword.tolist()] test_content = [str(a) for a in test_df.cutword.tolist()] tok.fit_on_texts(train_content) print(tok) ## 保存訓練好的Tokenizer和導入 with open('tok.pickle', 'wb') as handle: #saving pickle.dump(tok, handle, protocol=pickle.HIGHEST_PROTOCOL) with open('tok.pickle', 'rb') as handle: #loading tok = pickle.load(handle) #---------------------------第四步數據轉化為序列----------------------------- train_seq = tok.texts_to_sequences(train_content) val_seq = tok.texts_to_sequences(val_content) test_seq = tok.texts_to_sequences(test_content) ## 將每個序列調整為相同的長度 train_seq_mat = sequence.pad_sequences(train_seq,maxlen=max_len) val_seq_mat = sequence.pad_sequences(val_seq,maxlen=max_len) test_seq_mat = sequence.pad_sequences(test_seq,maxlen=max_len) print("數據轉換序列") print(train_seq_mat.shape) print(val_seq_mat.shape) print(test_seq_mat.shape) print(train_seq_mat[:2]) #-------------------------------第五步建立TextCNN模型-------------------------- ## 類別為4個 num_labels = 4 inputs = Input(name='inputs',shape=[max_len], dtype='float64') ## 詞嵌入使用預訓練的詞向量 layer = Embedding(max_words+1, 256, input_length=max_len, trainable=False)(inputs) ## 詞窗大小分別為3,4,5 cnn1 = Convolution1D(256, 3, padding='same', strides = 1, activation='relu')(layer) cnn1 = MaxPool1D(pool_size=4)(cnn1) cnn2 = Convolution1D(256, 4, padding='same', strides = 1, activation='relu')(layer) cnn2 = MaxPool1D(pool_size=4)(cnn2) cnn3 = Convolution1D(256, 5, padding='same', strides = 1, activation='relu')(layer) cnn3 = MaxPool1D(pool_size=4)(cnn3) # 合并三個模型的輸出向量 cnn = concatenate([cnn1,cnn2,cnn3], axis=-1) flat = Flatten()(cnn) drop = Dropout(0.2)(flat) main_output = Dense(num_labels, activation='softmax')(drop) model = Model(inputs=inputs, outputs=main_output) model.summary() model.compile(loss="categorical_crossentropy", optimizer='adam', # RMSprop() metrics=["accuracy"]) #-------------------------------第六步模型訓練和預測-------------------------- ## 先設置為train訓練再設置為test測試 flag = "train" if flag == "train": print("模型訓練") ## 模型訓練當val-loss不再提升時停止訓練 0.0001 model_fit = model.fit(train_seq_mat, train_y, batch_size=128, epochs=10, validation_data=(val_seq_mat,val_y), callbacks=[EarlyStopping(monitor='val_loss',min_delta=0.0001)] ) model.save('my_model.h5') del model elapsed = (time.clock() - start) print("Time used:", elapsed) print(model_fit.history) else: print("模型預測") ## 導入已經訓練好的模型 model = load_model('my_model.h5') ## 對測試集進行預測 test_pre = model.predict(test_seq_mat) ## 評價預測效果，計算混淆矩陣 confm = metrics.confusion_matrix(np.argmax(test_y,axis=1),np.argmax(test_pre,axis=1)) print(confm) ## 混淆矩陣可視化 Labname = ["體育", "文化", "財經", "游戲"] print(metrics.classification_report(np.argmax(test_y,axis=1),np.argmax(test_pre,axis=1))) plt.figure(figsize=(8,8)) sns.heatmap(confm.T, square=True, annot=True, fmt='d', cbar=False, linewidths=.6, cmap="YlGnBu") plt.xlabel('True label',size = 14) plt.ylabel('Predicted label', size = 14) plt.xticks(np.arange(4)+0.5, Labname, size = 12) plt.yticks(np.arange(4)+0.5, Labname, size = 12) plt.savefig('result.png') plt.show() #----------------------------------第七驗證算法-------------------------- ## 使用tok對驗證數據集重新預處理，并使用訓練好的模型進行預測 val_seq = tok.texts_to_sequences(val_df.cutword) ## 將每個序列調整為相同的長度 val_seq_mat = sequence.pad_sequences(val_seq,maxlen=max_len) ## 對驗證集進行預測 val_pre = model.predict(val_seq_mat) print(metrics.classification_report(np.argmax(val_y,axis=1),np.argmax(val_pre,axis=1))) elapsed = (time.clock() - start) print("Time used:", elapsed)

訓練模型如下所示：

__________________________________________________________________________________________________ Layer (type) Output Shape Param # Connected to ================================================================================================== inputs (InputLayer) (None, 600) 0 __________________________________________________________________________________________________ embedding_1 (Embedding) (None, 600, 256) 1536256 inputs[0][0] __________________________________________________________________________________________________ conv1d_1 (Conv1D) (None, 600, 256) 196864 embedding_1[0][0] __________________________________________________________________________________________________ conv1d_2 (Conv1D) (None, 600, 256) 262400 embedding_1[0][0] __________________________________________________________________________________________________ conv1d_3 (Conv1D) (None, 600, 256) 327936 embedding_1[0][0] __________________________________________________________________________________________________ max_pooling1d_1 (MaxPooling1D) (None, 150, 256) 0 conv1d_1[0][0] __________________________________________________________________________________________________ max_pooling1d_2 (MaxPooling1D) (None, 150, 256) 0 conv1d_2[0][0] __________________________________________________________________________________________________ max_pooling1d_3 (MaxPooling1D) (None, 150, 256) 0 conv1d_3[0][0] __________________________________________________________________________________________________ concatenate_1 (Concatenate) (None, 150, 768) 0 max_pooling1d_1[0][0] max_pooling1d_2[0][0] max_pooling1d_3[0][0] __________________________________________________________________________________________________ flatten_1 (Flatten) (None, 115200) 0 concatenate_1[0][0] __________________________________________________________________________________________________ dropout_1 (Dropout) (None, 115200) 0 flatten_1[0][0] __________________________________________________________________________________________________ dense_1 (Dense) (None, 4) 460804 dropout_1[0][0] ================================================================================================== Total params: 2,784,260 Trainable params: 1,248,004 Non-trainable params: 1,536,256 __________________________________________________________________________________________________

預測結果如下：

[[4448 238 182 132] [ 151 4572 124 153] [ 185 176 4545 94] [ 181 394 207 4218]] precision recall f1-score support 0 0.90 0.89 0.89 5000 1 0.85 0.91 0.88 5000 2 0.90 0.91 0.90 5000 3 0.92 0.84 0.88 5000 avg / total 0.89 0.89 0.89 20000 precision recall f1-score support 0 0.90 0.88 0.89 5000 1 0.86 0.93 0.89 5000 2 0.91 0.89 0.90 5000 3 0.92 0.88 0.90 5000 avg / total 0.90 0.90 0.90 20000

五.LSTM中文文本分類

1.原理介紹

Long Short Term 網絡（LSTM）是一種RNN（Recurrent Neural Network）特殊的類型，可以學習長期依賴信息。LSTM 由Hochreiter & Schmidhuber (1997)提出，并在近期被Alex Graves進行了改良和推廣。在很多問題，LSTM都取得相當巨大的成功，并得到了廣泛的使用。

由于RNN存在梯度消失的問題，人們對于序列索引位置t的隱藏結構做了改進，通過一些技巧讓隱藏結構復雜起來，來避免梯度消失的問題，這樣的特殊RNN就是我們的LSTM。LSTM的全稱是Long Short-Term Memory。LSTM由于其設計的特點，非常適合用于對時序數據的建模，如文本數據。LSTM的結構如下圖：

LSTM 通過刻意的設計來避免長期依賴問題。記住長期的信息在實踐中是 LSTM 的默認行為，而非需要付出很大代價才能獲得的能力。LSTM是在普通的RNN上面做了一些改進，LSTM RNN多了三個控制器，即：

輸入控制器

輸出控制器

忘記控制器

左邊多了個條主線，例如電影的主線劇情，而原本的RNN體系變成了分線劇情，并且三個控制器都在分線上。

輸入控制器（write gate）: 在輸入input時設置一個gate，gate的作用是判斷要不要寫入這個input到我們的內存Memory中，它相當于一個參數，也是可以被訓練的，這個參數就是用來控制要不要記住當下這個點。

輸出控制器（read gate）: 在輸出位置的gate，判斷要不要讀取現在的Memory。

忘記控制器（forget gate）: 處理位置的忘記控制器，判斷要不要忘記之前的Memory。

LSTM工作原理為：如果分支內容對于最終結果十分重要，輸入控制器會將這個分支內容按重要程度寫入主線內容，再進行分析；如果分線內容改變了我們之前的想法，那么忘記控制器會將某些主線內容忘記，然后按比例替換新內容，所以主線內容的更新就取決于輸入和忘記控制；最后的輸出會基于主線內容和分線內容。通過這三個gate能夠很好地控制我們的RNN，基于這些控制機制，LSTM是延緩記憶的良藥，從而帶來更好的結果。

2.代碼實現

Keras實現文本分類的LSTM代碼如下：

Keras_LSTM_cnews.py

""" Created on 2021-03-19 @author: xiuzhang Eastmount CSDN LSTM Model """ import os import time import pickle import pandas as pd import numpy as np from sklearn import metrics import matplotlib.pyplot as plt import seaborn as sns import tensorflow as tf from sklearn.preprocessing import LabelEncoder,OneHotEncoder from keras.models import Model from keras.layers import LSTM, Activation, Dense, Dropout, Input, Embedding from keras.layers import Convolution1D, MaxPool1D, Flatten from keras.preprocessing.text import Tokenizer from keras.preprocessing import sequence from keras.callbacks import EarlyStopping from keras.models import load_model from keras.models import Sequential #GPU加速 CuDNNLSTM比LSTM快 from keras.layers import CuDNNLSTM, CuDNNGRU ## GPU處理讀者如果是CPU注釋該部分代碼即可 ## 指定每個GPU進程中使用顯存的上限 0.9表示可以使用GPU 90%的資源進行訓練 os.environ["CUDA_DEVICES_ORDER"] = "PCI_BUS_IS" os.environ["CUDA_VISIBLE_DEVICES"] = "0" gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.8) sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options)) start = time.clock() #----------------------------第一步數據讀取---------------------------- ## 讀取測數據集 train_df = pd.read_csv("news_dataset_train_fc.csv") val_df = pd.read_csv("news_dataset_val_fc.csv") test_df = pd.read_csv("news_dataset_test_fc.csv") print(train_df.head()) ## 解決中文顯示問題 plt.rcParams['font.sans-serif'] = ['KaiTi'] #指定默認字體 SimHei黑體 plt.rcParams['axes.unicode_minus'] = False #解決保存圖像是負號' #--------------------------第二步 OneHotEncoder()編碼-------------------- ## 對數據集的標簽數據進行編碼 train_y = train_df.label val_y = val_df.label test_y = test_df.label print("Label:") print(train_y[:10]) le = LabelEncoder() train_y = le.fit_transform(train_y).reshape(-1,1) val_y = le.transform(val_y).reshape(-1,1) test_y = le.transform(test_y).reshape(-1,1) print("LabelEncoder") print(train_y[:10]) print(len(train_y)) ## 對數據集的標簽數據進行one-hot編碼 ohe = OneHotEncoder() train_y = ohe.fit_transform(train_y).toarray() val_y = ohe.transform(val_y).toarray() test_y = ohe.transform(test_y).toarray() print("OneHotEncoder:") print(train_y[:10]) #-----------------------第三步使用Tokenizer對詞組進行編碼-------------------- max_words = 6000 max_len = 600 tok = Tokenizer(num_words=max_words) #最大詞語數為6000 print(train_df.cutword[:5]) print(type(train_df.cutword)) ## 防止語料中存在數字str處理 train_content = [str(a) for a in train_df.cutword.tolist()] val_content = [str(a) for a in val_df.cutword.tolist()] test_content = [str(a) for a in test_df.cutword.tolist()] tok.fit_on_texts(train_content) print(tok) ## 保存訓練好的Tokenizer和導入 with open('tok.pickle', 'wb') as handle: #saving pickle.dump(tok, handle, protocol=pickle.HIGHEST_PROTOCOL) with open('tok.pickle', 'rb') as handle: #loading tok = pickle.load(handle) #---------------------------第四步數據轉化為序列----------------------------- train_seq = tok.texts_to_sequences(train_content) val_seq = tok.texts_to_sequences(val_content) test_seq = tok.texts_to_sequences(test_content) ## 將每個序列調整為相同的長度 train_seq_mat = sequence.pad_sequences(train_seq,maxlen=max_len) val_seq_mat = sequence.pad_sequences(val_seq,maxlen=max_len) test_seq_mat = sequence.pad_sequences(test_seq,maxlen=max_len) print("數據轉換序列") print(train_seq_mat.shape) print(val_seq_mat.shape) print(test_seq_mat.shape) print(train_seq_mat[:2]) #-------------------------------第五步建立LSTM模型-------------------------- ## 定義LSTM模型 inputs = Input(name='inputs',shape=[max_len],dtype='float64') ## Embedding(詞匯表大小,batch大小,每個新聞的詞長) layer = Embedding(max_words+1, 128, input_length=max_len)(inputs) #layer = LSTM(128)(layer) layer = CuDNNLSTM(128)(layer) layer = Dense(128, activation="relu", name="FC1")(layer) layer = Dropout(0.1)(layer) layer = Dense(4, activation="softmax", name="FC2")(layer) model = Model(inputs=inputs, outputs=layer) model.summary() model.compile(loss="categorical_crossentropy", optimizer='adam', # RMSprop() metrics=["accuracy"]) #-------------------------------第六步模型訓練和預測-------------------------- ## 先設置為train訓練再設置為test測試 flag = "train" if flag == "train": print("模型訓練") ## 模型訓練當val-loss不再提升時停止訓練 0.0001 model_fit = model.fit(train_seq_mat, train_y, batch_size=128, epochs=10, validation_data=(val_seq_mat,val_y), callbacks=[EarlyStopping(monitor='val_loss',min_delta=0.0001)] ) model.save('my_model.h5') del model elapsed = (time.clock() - start) print("Time used:", elapsed) print(model_fit.history) else: print("模型預測") ## 導入已經訓練好的模型 model = load_model('my_model.h5') ## 對測試集進行預測 test_pre = model.predict(test_seq_mat) ## 評價預測效果，計算混淆矩陣 confm = metrics.confusion_matrix(np.argmax(test_y,axis=1),np.argmax(test_pre,axis=1)) print(confm) ## 混淆矩陣可視化 Labname = ["體育", "文化", "財經", "游戲"] print(metrics.classification_report(np.argmax(test_y,axis=1),np.argmax(test_pre,axis=1))) plt.figure(figsize=(8,8)) sns.heatmap(confm.T, square=True, annot=True, fmt='d', cbar=False, linewidths=.6, cmap="YlGnBu") plt.xlabel('True label',size = 14) plt.ylabel('Predicted label', size = 14) plt.xticks(np.arange(4)+0.8, Labname, size = 12) plt.yticks(np.arange(4)+0.4, Labname, size = 12) plt.savefig('result.png') plt.show() #----------------------------------第七驗證算法-------------------------- ## 使用tok對驗證數據集重新預處理，并使用訓練好的模型進行預測 val_seq = tok.texts_to_sequences(val_df.cutword) ## 將每個序列調整為相同的長度 val_seq_mat = sequence.pad_sequences(val_seq,maxlen=max_len) ## 對驗證集進行預測 val_pre = model.predict(val_seq_mat) print(metrics.classification_report(np.argmax(val_y,axis=1),np.argmax(val_pre,axis=1))) elapsed = (time.clock() - start) print("Time used:", elapsed)

訓練輸出模型如下所示：

_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= inputs (InputLayer) (None, 600) 0 _________________________________________________________________ embedding_1 (Embedding) (None, 600, 128) 768128 _________________________________________________________________ cu_dnnlstm_1 (CuDNNLSTM) (None, 128) 132096 _________________________________________________________________ FC1 (Dense) (None, 128) 16512 _________________________________________________________________ dropout_1 (Dropout) (None, 128) 0 _________________________________________________________________ FC2 (Dense) (None, 4) 516 ================================================================= Total params: 917,252 Trainable params: 917,252 Non-trainable params: 0

預測結果如下所示：

[[4539 153 188 120] [ 47 4628 181 144] [ 113 133 4697 57] [ 101 292 157 4450]] precision recall f1-score support 0 0.95 0.91 0.93 5000 1 0.89 0.93 0.91 5000 2 0.90 0.94 0.92 5000 3 0.93 0.89 0.91 5000 avg / total 0.92 0.92 0.92 20000 precision recall f1-score support 0 0.96 0.89 0.92 5000 1 0.89 0.94 0.92 5000 2 0.90 0.93 0.92 5000 3 0.94 0.92 0.93 5000 avg / total 0.92 0.92 0.92 20000

六.BiLSTM中文文本分類

1.原理介紹

BiLSTM是Bi-directional Long Short-Term Memory的縮寫，是由前向LSTM與后向LSTM組合而成。它和LSTM在自然語言處理任務中都常被用來建模上下文信息。前向的LSTM與后向的LSTM結合成BiLSTM。比如，我們對“我愛中國”這句話進行編碼，模型如圖所示。

由于利用LSTM對句子進行建模還存在一個問題：無法編碼從后到前的信息。在更細粒度的分類時，如對于強程度的褒義、弱程度的褒義、中性、弱程度的貶義、強程度的貶義的五分類任務需要注意情感詞、程度詞、否定詞之間的交互。舉一個例子，“這個餐廳臟得不行，沒有隔壁好”，這里的“不行”是對“臟”的程度的一種修飾，通過BiLSTM可以更好的捕捉雙向的語義依賴。

參考文章：https://zhuanlan.zhihu.com/p/47802053

2.代碼實現

Keras實現文本分類的BiLSTM代碼如下：

Keras_BiLSTM_cnews.py

""" Created on 2021-03-19 @author: xiuzhang Eastmount CSDN BiLSTM Model """ import os import time import pickle import pandas as pd import numpy as np from sklearn import metrics import matplotlib.pyplot as plt import seaborn as sns import tensorflow as tf from sklearn.preprocessing import LabelEncoder,OneHotEncoder from keras.models import Model from keras.layers import LSTM, Activation, Dense, Dropout, Input, Embedding from keras.layers import Convolution1D, MaxPool1D, Flatten from keras.preprocessing.text import Tokenizer from keras.preprocessing import sequence from keras.callbacks import EarlyStopping from keras.models import load_model from keras.models import Sequential #GPU加速 CuDNNLSTM比LSTM快 from keras.layers import CuDNNLSTM, CuDNNGRU from keras.layers import Bidirectional ## GPU處理讀者如果是CPU注釋該部分代碼即可 ## 指定每個GPU進程中使用顯存的上限 0.9表示可以使用GPU 90%的資源進行訓練 os.environ["CUDA_DEVICES_ORDER"] = "PCI_BUS_IS" os.environ["CUDA_VISIBLE_DEVICES"] = "0" gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.8) sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options)) start = time.clock() #----------------------------第一步數據讀取---------------------------- ## 讀取測數據集 train_df = pd.read_csv("news_dataset_train_fc.csv") val_df = pd.read_csv("news_dataset_val_fc.csv") test_df = pd.read_csv("news_dataset_test_fc.csv") print(train_df.head()) ## 解決中文顯示問題 plt.rcParams['font.sans-serif'] = ['KaiTi'] #指定默認字體 SimHei黑體 plt.rcParams['axes.unicode_minus'] = False #解決保存圖像是負號' #--------------------------第二步 OneHotEncoder()編碼-------------------- ## 對數據集的標簽數據進行編碼 train_y = train_df.label val_y = val_df.label test_y = test_df.label print("Label:") print(train_y[:10]) le = LabelEncoder() train_y = le.fit_transform(train_y).reshape(-1,1) val_y = le.transform(val_y).reshape(-1,1) test_y = le.transform(test_y).reshape(-1,1) print("LabelEncoder") print(train_y[:10]) print(len(train_y)) ## 對數據集的標簽數據進行one-hot編碼 ohe = OneHotEncoder() train_y = ohe.fit_transform(train_y).toarray() val_y = ohe.transform(val_y).toarray() test_y = ohe.transform(test_y).toarray() print("OneHotEncoder:") print(train_y[:10]) #-----------------------第三步使用Tokenizer對詞組進行編碼-------------------- max_words = 6000 max_len = 600 tok = Tokenizer(num_words=max_words) #最大詞語數為6000 print(train_df.cutword[:5]) print(type(train_df.cutword)) ## 防止語料中存在數字str處理 train_content = [str(a) for a in train_df.cutword.tolist()] val_content = [str(a) for a in val_df.cutword.tolist()] test_content = [str(a) for a in test_df.cutword.tolist()] tok.fit_on_texts(train_content) print(tok) ## 保存訓練好的Tokenizer和導入 with open('tok.pickle', 'wb') as handle: #saving pickle.dump(tok, handle, protocol=pickle.HIGHEST_PROTOCOL) with open('tok.pickle', 'rb') as handle: #loading tok = pickle.load(handle) #---------------------------第四步數據轉化為序列----------------------------- train_seq = tok.texts_to_sequences(train_content) val_seq = tok.texts_to_sequences(val_content) test_seq = tok.texts_to_sequences(test_content) ## 將每個序列調整為相同的長度 train_seq_mat = sequence.pad_sequences(train_seq,maxlen=max_len) val_seq_mat = sequence.pad_sequences(val_seq,maxlen=max_len) test_seq_mat = sequence.pad_sequences(test_seq,maxlen=max_len) print("數據轉換序列") print(train_seq_mat.shape) print(val_seq_mat.shape) print(test_seq_mat.shape) print(train_seq_mat[:2]) #-------------------------------第五步建立BiLSTM模型-------------------------- num_labels = 4 model = Sequential() model.add(Embedding(max_words+1, 128, input_length=max_len)) model.add(Bidirectional(CuDNNLSTM(128))) model.add(Dense(128, activation='relu')) model.add(Dropout(0.3)) model.add(Dense(num_labels, activation='softmax')) model.summary() model.compile(loss="categorical_crossentropy", optimizer='adam', # RMSprop() metrics=["accuracy"]) #-------------------------------第六步模型訓練和預測-------------------------- ## 先設置為train訓練再設置為test測試 flag = "train" if flag == "train": print("模型訓練") ## 模型訓練當val-loss不再提升時停止訓練 0.0001 model_fit = model.fit(train_seq_mat, train_y, batch_size=128, epochs=10, validation_data=(val_seq_mat,val_y), callbacks=[EarlyStopping(monitor='val_loss',min_delta=0.0001)] ) model.save('my_model.h5') del model elapsed = (time.clock() - start) print("Time used:", elapsed) print(model_fit.history) else: print("模型預測") ## 導入已經訓練好的模型 model = load_model('my_model.h5') ## 對測試集進行預測 test_pre = model.predict(test_seq_mat) ## 評價預測效果，計算混淆矩陣 confm = metrics.confusion_matrix(np.argmax(test_y,axis=1),np.argmax(test_pre,axis=1)) print(confm) ## 混淆矩陣可視化 Labname = ["體育", "文化", "財經", "游戲"] print(metrics.classification_report(np.argmax(test_y,axis=1),np.argmax(test_pre,axis=1))) plt.figure(figsize=(8,8)) sns.heatmap(confm.T, square=True, annot=True, fmt='d', cbar=False, linewidths=.6, cmap="YlGnBu") plt.xlabel('True label',size = 14) plt.ylabel('Predicted label', size = 14) plt.xticks(np.arange(4)+0.5, Labname, size = 12) plt.yticks(np.arange(4)+0.5, Labname, size = 12) plt.savefig('result.png') plt.show() #----------------------------------第七驗證算法-------------------------- ## 使用tok對驗證數據集重新預處理，并使用訓練好的模型進行預測 val_seq = tok.texts_to_sequences(val_df.cutword) ## 將每個序列調整為相同的長度 val_seq_mat = sequence.pad_sequences(val_seq,maxlen=max_len) ## 對驗證集進行預測 val_pre = model.predict(val_seq_mat) print(metrics.classification_report(np.argmax(val_y,axis=1),np.argmax(val_pre,axis=1))) elapsed = (time.clock() - start) print("Time used:", elapsed)

訓練輸出模型如下所示，GPU時間還是非常快。

_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding_1 (Embedding) (None, 600, 128) 768128 _________________________________________________________________ bidirectional_1 (Bidirection (None, 256) 264192 _________________________________________________________________ dense_1 (Dense) (None, 128) 32896 _________________________________________________________________ dropout_1 (Dropout) (None, 128) 0 _________________________________________________________________ dense_2 (Dense) (None, 4) 516 ================================================================= Total params: 1,065,732 Trainable params: 1,065,732 Non-trainable params: 0 Train on 40000 samples, validate on 20000 samples Epoch 1/10 40000/40000 [==============================] - 23s 587us/step - loss: 0.5825 - acc: 0.8038 - val_loss: 0.2321 - val_acc: 0.9246 Epoch 2/10 40000/40000 [==============================] - 21s 521us/step - loss: 0.1433 - acc: 0.9542 - val_loss: 0.2422 - val_acc: 0.9228 Time used: 52.763230400000005

預測結果如下圖所示：

[[4593 143 113 151] [ 81 4679 60 180] [ 110 199 4590 101] [ 73 254 82 4591]] precision recall f1-score support 0 0.95 0.92 0.93 5000 1 0.89 0.94 0.91 5000 2 0.95 0.92 0.93 5000 3 0.91 0.92 0.92 5000 avg / total 0.92 0.92 0.92 20000 precision recall f1-score support 0 0.94 0.90 0.92 5000 1 0.89 0.95 0.92 5000 2 0.95 0.90 0.93 5000 3 0.91 0.94 0.93 5000 avg / total 0.92 0.92 0.92 20000

七.BiLSTM+Attention中文文本分類

1.原理介紹

Attention機制是模仿人類注意力而提出的一種解決問題的辦法，簡單地說就是從大量信息中快速篩選出高價值信息。主要用于解決LSTM/RNN模型輸入序列較長的時候很難獲得最終合理的向量表示問題，做法是保留LSTM的中間結果，用新的模型對其進行學習，并將其與輸出進行關聯，從而達到信息篩選的目的。

What is attention？

先簡單描述一下attention機制是什么。相信做NLP的同學對這個機制不會很陌生，它在論文?《Attention is all you need》?中可以說是大放異彩，在machine translation任務中，幫助深度模型在性能上有了很大的提升，輸出了當時最好的state-of-art model。當然該模型除了attention機制外，還用了很多有用的trick，以幫助提升模型性能。但是不能否認的時，這個模型的核心就是attention。

attention機制又稱為注意力機制，顧名思義，是一種能讓模型對重要信息重點關注并充分學習吸收的技術，它不算是一個完整的模型，應當是一種技術，能夠作用于任何序列模型中。

[Python人工智能] 二十八.Keras深度學習中文文本分類萬字總結(CNN、TextCNN、BiLSTM、注意力)

Why attention？

為什么要引入attention機制。比如在seq2seq模型中，對于一段文本序列，我們通常要使用某種機制對該序列進行編碼，通過降維等方式將其encode成一個固定長度的向量，用于輸入到后面的全連接層。一般我們會使用CNN或者RNN（包括GRU或者LSTM）等模型來對序列數據進行編碼，然后采用各種pooling或者對RNN直接取最后一個t時刻的hidden state作為句子的向量輸出。

但這里會有一個問題：?常規的編碼方法，無法體現對一個句子序列中不同語素的關注程度，在自然語言中，一個句子中的不同部分是有不同含義和重要性的，比如上面的例子中：I hate this movie.如果做情感分析，明顯對hate這個詞語應當關注更多。當然是用CNN和RNN能夠編碼這種信息，但這種編碼能力也是有上限的，對于較長的文本，模型效果不會再提升太多。

參考及推薦文章：https://zhuanlan.zhihu.com/p/46313756

Attention的應用領域非常廣泛，文本、圖片等都有應用。

文本：應用于seq2seq模型，最常見的應用是翻譯

圖片：應用于卷積神經網絡的圖片提取

語音

下圖是一個比較經典的BiLSTM+Attention模型，也是我們接下來需要建立的模型。

2.代碼實現

Keras實現文本分類的BiLSTM+Attention代碼如下：

Keras_Attention_BiLSTM_cnews.py

""" Created on 2021-03-19 @author: xiuzhang Eastmount CSDN BiLSTM+Attention Model """ import os import time import pickle import pandas as pd import numpy as np from sklearn import metrics import matplotlib.pyplot as plt import seaborn as sns import tensorflow as tf from sklearn.preprocessing import LabelEncoder,OneHotEncoder from keras.models import Model from keras.layers import LSTM, Activation, Dense, Dropout, Input, Embedding from keras.layers import Convolution1D, MaxPool1D, Flatten from keras.preprocessing.text import Tokenizer from keras.preprocessing import sequence from keras.callbacks import EarlyStopping from keras.models import load_model from keras.models import Sequential #GPU加速 CuDNNLSTM比LSTM快 from keras.layers import CuDNNLSTM, CuDNNGRU from keras.layers import Bidirectional ## GPU處理讀者如果是CPU注釋該部分代碼即可 ## 指定每個GPU進程中使用顯存的上限 0.9表示可以使用GPU 90%的資源進行訓練 os.environ["CUDA_DEVICES_ORDER"] = "PCI_BUS_IS" os.environ["CUDA_VISIBLE_DEVICES"] = "0" gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.8) sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options)) start = time.clock() #----------------------------第一步數據讀取---------------------------- ## 讀取測數據集 train_df = pd.read_csv("news_dataset_train_fc.csv") val_df = pd.read_csv("news_dataset_val_fc.csv") test_df = pd.read_csv("news_dataset_test_fc.csv") print(train_df.head()) ## 解決中文顯示問題 plt.rcParams['font.sans-serif'] = ['KaiTi'] #指定默認字體 SimHei黑體 plt.rcParams['axes.unicode_minus'] = False #解決保存圖像是負號' #--------------------------第二步 OneHotEncoder()編碼-------------------- ## 對數據集的標簽數據進行編碼 train_y = train_df.label val_y = val_df.label test_y = test_df.label print("Label:") print(train_y[:10]) le = LabelEncoder() train_y = le.fit_transform(train_y).reshape(-1,1) val_y = le.transform(val_y).reshape(-1,1) test_y = le.transform(test_y).reshape(-1,1) print("LabelEncoder") print(train_y[:10]) print(len(train_y)) ## 對數據集的標簽數據進行one-hot編碼 ohe = OneHotEncoder() train_y = ohe.fit_transform(train_y).toarray() val_y = ohe.transform(val_y).toarray() test_y = ohe.transform(test_y).toarray() print("OneHotEncoder:") print(train_y[:10]) #-----------------------第三步使用Tokenizer對詞組進行編碼-------------------- max_words = 6000 max_len = 600 tok = Tokenizer(num_words=max_words) #最大詞語數為6000 print(train_df.cutword[:5]) print(type(train_df.cutword)) ## 防止語料中存在數字str處理 train_content = [str(a) for a in train_df.cutword.tolist()] val_content = [str(a) for a in val_df.cutword.tolist()] test_content = [str(a) for a in test_df.cutword.tolist()] tok.fit_on_texts(train_content) print(tok) ## 保存訓練好的Tokenizer和導入 with open('tok.pickle', 'wb') as handle: #saving pickle.dump(tok, handle, protocol=pickle.HIGHEST_PROTOCOL) with open('tok.pickle', 'rb') as handle: #loading tok = pickle.load(handle) #---------------------------第四步數據轉化為序列----------------------------- train_seq = tok.texts_to_sequences(train_content) val_seq = tok.texts_to_sequences(val_content) test_seq = tok.texts_to_sequences(test_content) ## 將每個序列調整為相同的長度 train_seq_mat = sequence.pad_sequences(train_seq,maxlen=max_len) val_seq_mat = sequence.pad_sequences(val_seq,maxlen=max_len) test_seq_mat = sequence.pad_sequences(test_seq,maxlen=max_len) print("數據轉換序列") print(train_seq_mat.shape) print(val_seq_mat.shape) print(test_seq_mat.shape) print(train_seq_mat[:2]) #---------------------------第五步建立Attention機制---------------------- """ 由于Keras目前還沒有現成的Attention層可以直接使用，我們需要自己來構建一個新的層函數。 Keras自定義的函數主要分為四個部分，分別是： init：初始化一些需要的參數 bulid：具體來定義權重是怎么樣的 call：核心部分，定義向量是如何進行運算的 compute_output_shape：定義該層輸出的大小推薦文章： https://blog.csdn.net/huanghaocs/article/details/95752379 https://zhuanlan.zhihu.com/p/29201491 """ # Hierarchical Model with Attention from keras import initializers from keras import constraints from keras import activations from keras import regularizers from keras import backend as K from keras.engine.topology import Layer K.clear_session() class AttentionLayer(Layer): def __init__(self, attention_size=None, **kwargs): self.attention_size = attention_size super(AttentionLayer, self).__init__(**kwargs) def get_config(self): config = super().get_config() config['attention_size'] = self.attention_size return config def build(self, input_shape): assert len(input_shape) == 3 self.time_steps = input_shape[1] hidden_size = input_shape[2] if self.attention_size is None: self.attention_size = hidden_size self.W = self.add_weight(name='att_weight', shape=(hidden_size, self.attention_size), initializer='uniform', trainable=True) self.b = self.add_weight(name='att_bias', shape=(self.attention_size,), initializer='uniform', trainable=True) self.V = self.add_weight(name='att_var', shape=(self.attention_size,), initializer='uniform', trainable=True) super(AttentionLayer, self).build(input_shape) def call(self, inputs): self.V = K.reshape(self.V, (-1, 1)) H = K.tanh(K.dot(inputs, self.W) + self.b) score = K.softmax(K.dot(H, self.V), axis=1) outputs = K.sum(score * inputs, axis=1) return outputs def compute_output_shape(self, input_shape): return input_shape[0], input_shape[2] #-------------------------------第六步建立BiLSTM模型-------------------------- ## 定義BiLSTM模型 ## BiLSTM+Attention num_labels = 4 inputs = Input(name='inputs',shape=[max_len],dtype='float64') layer = Embedding(max_words+1, 256, input_length=max_len)(inputs) #lstm = Bidirectional(LSTM(100, dropout=0.2, recurrent_dropout=0.1, return_sequences=True))(layer) bilstm = Bidirectional(CuDNNLSTM(128, return_sequences=True))(layer) #參數保持維度3 layer = Dense(128, activation='relu')(bilstm) layer = Dropout(0.2)(layer) ## 注意力機制 attention = AttentionLayer(attention_size=50)(layer) output = Dense(num_labels, activation='softmax')(attention) model = Model(inputs=inputs, outputs=output) model.summary() model.compile(loss="categorical_crossentropy", optimizer='adam', # RMSprop() metrics=["accuracy"]) #-------------------------------第七步模型訓練和預測-------------------------- ## 先設置為train訓練再設置為test測試 flag = "test" if flag == "train": print("模型訓練") ## 模型訓練當val-loss不再提升時停止訓練 0.0001 model_fit = model.fit(train_seq_mat, train_y, batch_size=128, epochs=10, validation_data=(val_seq_mat,val_y), callbacks=[EarlyStopping(monitor='val_loss',min_delta=0.0001)] ) ## 保存模型 model.save('my_model.h5') del model # deletes the existing model ## 計算時間 elapsed = (time.clock() - start) print("Time used:", elapsed) print(model_fit.history) else: print("模型預測") ## 導入已經訓練好的模型 model = load_model('my_model.h5', custom_objects={'AttentionLayer': AttentionLayer(50)}, compile=False) ## 對測試集進行預測 test_pre = model.predict(test_seq_mat) ## 評價預測效果，計算混淆矩陣 confm = metrics.confusion_matrix(np.argmax(test_y,axis=1),np.argmax(test_pre,axis=1)) print(confm) ## 混淆矩陣可視化 Labname = ["體育", "文化", "財經", "游戲"] print(metrics.classification_report(np.argmax(test_y,axis=1),np.argmax(test_pre,axis=1))) plt.figure(figsize=(8,8)) sns.heatmap(confm.T, square=True, annot=True, fmt='d', cbar=False, linewidths=.6, cmap="YlGnBu") plt.xlabel('True label',size = 14) plt.ylabel('Predicted label', size = 14) plt.xticks(np.arange(4)+0.5, Labname, size = 12) plt.yticks(np.arange(4)+0.5, Labname, size = 12) plt.savefig('result.png') plt.show() #----------------------------------第七驗證算法-------------------------- ## 使用tok對驗證數據集重新預處理，并使用訓練好的模型進行預測 val_seq = tok.texts_to_sequences(val_df.cutword) ## 將每個序列調整為相同的長度 val_seq_mat = sequence.pad_sequences(val_seq,maxlen=max_len) ## 對驗證集進行預測 val_pre = model.predict(val_seq_mat) print(metrics.classification_report(np.argmax(val_y,axis=1),np.argmax(val_pre,axis=1))) ## 計算時間 elapsed = (time.clock() - start) print("Time used:", elapsed)

訓練輸出模型如下所示：

_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= inputs (InputLayer) (None, 600) 0 _________________________________________________________________ embedding_1 (Embedding) (None, 600, 256) 1536256 _________________________________________________________________ bidirectional_1 (Bidirection (None, 600, 256) 395264 _________________________________________________________________ dense_1 (Dense) (None, 600, 128) 32896 _________________________________________________________________ dropout_1 (Dropout) (None, 600, 128) 0 _________________________________________________________________ attention_layer_1 (Attention (None, 128) 6500 _________________________________________________________________ dense_2 (Dense) (None, 4) 516 ================================================================= Total params: 1,971,432 Trainable params: 1,971,432 Non-trainable params: 0

預測結果如下圖所示：

[[4625 138 100 137] [ 63 4692 77 168] [ 129 190 4589 92] [ 82 299 78 4541]] precision recall f1-score support 0 0.94 0.93 0.93 5000 1 0.88 0.94 0.91 5000 2 0.95 0.92 0.93 5000 3 0.92 0.91 0.91 5000 avg / total 0.92 0.92 0.92 20000 precision recall f1-score support 0 0.95 0.91 0.93 5000 1 0.88 0.95 0.91 5000 2 0.95 0.90 0.92 5000 3 0.92 0.93 0.93 5000 avg / total 0.92 0.92 0.92 20000

八.總結

一.文本分類概述

二.數據預處理及分詞

三.CNN中文文本分類

四.TextCNN中文文本分類

五.LSTM中文文本分類

六.BiLSTM中文文本分類

七.BiLSTM+Attention中文文本分類

對比實驗結果如下圖所示，效果非常不理想，大家可以思考幾個問題，我們后面的文章繼續介紹。

實驗結果怎么自定義函數評價，否則系統自帶保留兩位小時

實驗結果怎么進行可視化分析

實驗結果怎么進行參數選擇和優化

實驗過程的誤差曲線、準確率曲線、AUC曲線怎么繪制

希望您喜歡這篇文章，從看視頻到撰寫代碼，我真的寫了一周時間，再次感謝參考文獻的老師們。真心希望這篇文章對您有所幫助，加油~

https://github.com/eastmountyxz/AI-for-Keras

2022年加油，感恩能與大家在華為云遇見！

希望能與大家一起在華為云社區共同成長。原文地址：https://blog.csdn.net/Eastmount/article/details/114809729

(By:娜璋之家 Eastmount 2022-01-07 夜于武漢)

Keras Python 深度學習神經網絡

辦公 自動化(三) | 借助服務器定時爬數據發郵件">python辦公 自動化(三) | 借助服務器定時爬數據發郵件

1011 2025-03-31

機器學習服務提取圖片的特征向量">使用SAP Leonardo上的機器學習服務提取圖片的特征向量

1011 2025-03-31

Python3 網絡爬蟲開發實戰] 1.4.3-Redis 的安裝">[Python3 網絡爬蟲開發實戰] 1.4.3-Redis 的安裝

1011 2025-03-31

[Python 人工智能] 二十八.Keras深度學習中文文本分類萬字總結(CNN、TextCNN、BiLSTM、注意力)

辦公 自動化(三) | 借助服務器定時爬數據發郵件">python辦公 自動化(三) | 借助服務器定時爬數據發郵件

機器學習服務提取圖片的特征向量">使用SAP Leonardo上的機器學習服務提取圖片的特征向量

Python3 網絡爬蟲開發實戰] 1.4.3-Redis 的安裝">[Python3 網絡爬蟲開發實戰] 1.4.3-Redis 的安裝

推薦文章

企業生產管理是什么，企業生產管理軟件

進盤點進銷存軟件排行榜前十名

進銷存系統哪個簡單好用？進銷存系統優點

工廠生產管理（工廠生產管理流程及制度）

生產管理軟件，機械制造業生產管理，制造業生產過程管理軟件

進銷存軟件和ERP有什么區別？進銷存與erp軟件理解

進銷存如何進行庫存管理

如何利用excel制作銷售訂單管理系統？

數據庫訂單管理系統有哪些功能？數據庫訂單管理系統怎么設計？

什么是數據庫管理系統？

最近發表

熱評文章

零代碼開發是什么？2022低代碼平臺排行榜">零代碼開發是什么？2022低代碼平臺排行榜

進銷存庫存管理 系統（智慧進銷存）">智能進銷存庫存管理系統（智慧進銷存）

在線文檔哪家強？8款在線文檔編輯軟件推薦">在線文檔哪家強？8款在線文檔編輯軟件推薦

WPS2016怎么繪制簡單的價格表?

Excel項目進度表模板，簡化您的項目進度管理">Excel項目進度表模板，簡化您的項目進度管理

系統的功能有哪些？餐飲服務系統的構成及工作程序">連鎖餐飲管理系統的功能有哪些？餐飲服務系統的構成及工

友情鏈接