Convolutional Neural Networks for Sentence Classification(基于卷积神经网络的句子分类)
三大顶会 ACL EMNLP NAACL
一、论文总览:
Abstract:使用卷积神经网络处理句子级别的文本分类,并在多个数据集上取得很好效果
Introduction:通过使用预训练的词向量和卷积神经网络,文本提出一种简单且有效的文本分类模型。
Model:TextCNN模型结构和正则化
Datasets and Experimental Setp:数据集介绍,实验超参设置以及实验结果。
Results and Discussion:实验研究,通道个数讨论和词向量使用方法讨论
Conclusion:全文总结
二、目标
(一)TextCnn
卷积层
池化层
(二)减少过拟合
正则化
Dropout
(三)超参数选择
词向量设置方式
卷积核大小
卷积核个数
激活函数
正则化
(四)代码实现
三、论文总览
深度学习的发展
词向量的发展
CNN的发展
(一)Introduction
词向量的发展:Deep learning models have achieved remarkable results in computer vision (Krizhevsky et al., 2012) and speech recognition (Graves et al., 2013) in recent years. Within natural language processing, much of the work with deep learning methods has involved learning word vector representations through neural language models (Bengio et al., 2003; Yih et al., 2011; Mikolov et al., 2013) and performing composition over the learned word vectors for classification (Collobert et al., 2011). Word vectors, wherein words are projected from a sparse, 1-of-V encoding (here V is the vocabulary size) onto a lower dimensional vector space via a hidden layer, are essentially feature extractors that encode semantic features of words in their dimensions. In such dense representations, semantically close words are likewise close—in euclidean or cosine distance—in the lower dimensional vector space.
CNN的发展:Convolutional neural networks (CNN) utilize layers with convolving filters that are applied to local features (LeCun et al., 1998). Originally invented for computer vision, CNN models have subsequently been shown to be effective for NLP and have achieved excellent results in semantic parsing (Yih et al., 2014), search query retrieval (Shen et al., 2014), sentence modeling (Kalchbrenner et al., 2014), and other traditional NLP tasks (Collobert et al., 2011).
In the present work, we train a simple CNN with one layer of convolution on top of word vectors obtained from an unsupervised neural language model. These vectors were trained by Mikolov et al. (2013) on100 billion words of Google News(词向量来源), and are publicly available.1 We initially keep the word vectors static and learn only the other parameters of the model. Despite little tuning of hyperparameters, this simple model achieves excellent results on multiple benchmarks, suggesting that the pre-trained vectors are ‘universal’ feature extractors that can be utilized for various classification tasks(预训练的词向量可以一些任务上通用). Learning task-specific vectors through fine-tuning results in further improvements. We finally describe a simple modification to the architecture to allow for the use of both pre-trained and task-specific vectors by having multiple channels(混合使用词向量)
Our work is philosophically similar to Razavian et al. (2014) which showed that for image classification, feature extractors obtained from a pretrained deep learning model perform well on a variety of tasks—including tasks that are very different from the original task for which the feature extractors were trained.
使用简单的CNN模型在预训练词向量基本上进行微调就可以在文本分类任务上得到很好的结果
通过对词向量进行微调而获得的任务指向的词向量能够得到更好的结果。
我们也提出了一种即使使用静态预训练词向量又使用任务指向词向量的文本模型
最终我们在7个文本分类任务中的四个上都取得了最好的分类准确率
(二)Model
A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification卷积神经网络用于句子分类的敏感性分析(和从业者指南)
2.1 Regularization TextCnn正则化
1.Dropout:在神经网络的传播过程中,让某个神经元以一定的概率停止工作,从而增加模型的泛化能力
(三)Datasets and Experimental Setup
MR: Movie reviews with one sentence per review. Classification involves detecting positive/negative reviews (Pang and Lee, 2005).
SST-1: Stanford Sentiment Treebank—an extension of MR but with train/dev/test splits provided and fine-grained labels (very positive, positive, neutral, negative, very negative), re-labeled by Socher et al. (2013).
SST-2: Same as SST-1 but with neutral reviews removed and binary labels. ? Subj: Subjectivity dataset where the task is to classify a sentence as being subjective or objective (Pang and Lee, 2004).
TREC: TREC question dataset—task involves classifying a question into 6 question types (whether the question is about person, location, numeric information, etc.) (Li and Roth, 2002).5
CR: Customer reviews of various products (cameras, MP3s etc.). Task is to predict positive/negative reviews (Hu and Liu, 2004)
MPQA: Opinion polarity detection subtask of the MPQA dataset (Wiebe et al., 2005).
3.1 Hyperparameters and Training
参数设置:
windows(h):3,4,5 with 100 feature maps each
dropout rate(p):0.5
l2 constraint(s):3
mini-batch:50
来源:SST-2 验证集上进行网格搜索
3.2 Pre-trained Word Vectors
We use the publicly available word2vec vectors that were trained on 100 billion words from Google News. The vectors have dimensionality of 300 and were trained using the continuous bag-of-words architecture
词向量:word2vec
vocable size: 100billion
datas: Google News
dimension:300
architecture: CBOW
(四) Results and Discussion
Results of our models against other methods are listed in table 2. Our baseline modelwith all randomly initialized words (CNN-rand) does not perform well on its own(CNN+随机初始化表现不好). While we had expected performance gains through the use of pre-trained vectors, we were surprised at the magnitude of the gains. Even a simple model with static vectors (CNN-static) performs remarkably well(如果使用预训练的词向量,会提升非常大), giving competitive results against the more sophisticated deep learning models that utilize complex pooling schemes (Kalchbrenner et al., 2014) or require parse trees to be computed beforehand (Socher et al., 2013). These results suggest that the pretrained vectors are good, ‘universal’ feature extractors and can be utilized across datasets. Finetuning the pre-trained vectors for each task gives still further improvements (CNN-non-static).
4.1 Multichannel vs. Single Channel Models (多通道和单通道的对比)
We had initially hoped that the multichannel architecture would prevent overfitting(希望通过多通道来避免过拟合)(by ensuring that the learned vectors do not deviate too far from the original values) and thus work better than the single channel model, especially on smaller datasets. The results,however(然而,实现结果差不多), are mixed, and further work on regularizing the fine-tuning process is warranted.For instance, instead of using an additional channel for the non-static portion(可以额外的增加非静态的channel),one could maintain a single channel but employ extra dimensions that are allowed to be modified during training.
4.2 Static vs. Non-static Representations
As is the case with the single channel non-static model, the multichannel model is able to fine-tune the non-static channel to make it more specific to the task-at-hand.For example, good is most similar to bad in word2vec, presumably because they are (almost) syntactically equivalent. (举例,在word2vec中,good和bad很接近,因为他们的语法是很接近的。)But for vectors in the non-static channel that were finetuned on the SST-2 dataset, this is no longer the case (table 3). Similarly, good is arguably closer to nice than it is to great for expressing sentiment, and this is indeed reflected in the learned vectors.
For (randomly initialized) tokens not in the set of pre-trained vectors, fine-tuning allows them to learn more meaningful representations: the network learns that exclamation marks are associated with effusive expressions and that commas are conjunctive (table 3).
4.3 Further Observations
We report on some further experiments and observations:
效果提升很多,因为使用了更多的feature maps。Kalchbrenner et al. (2014) report much worse results with a CNN that has essentially the same architecture as our single channel model. For example, their Max-TDNN (Time Delay Neural Network) with randomly initialized words obtains 37.4% on the SST-1 dataset, compared to 45.0% for our model. We attribute such discrepancy to our CNN having much more capacity (multiple filter widths and feature maps).
Dropout proved to be such a good regularizer that it was fine to use a larger than necessary network and simply let dropout regularize it. Dropout consistently added 2%–4% relative performance(dropout 可以提高2%-4%的表现). ? When randomly initializing words not in word2vec, we obtained slight improvements by sampling each dimension from U[?a, a] where a was chosen such that the randomly initialized vectors have the same variance as the pre-trained ones. It would be interesting to see if employing more sophisticated methods to mirror the distribution of pre-trained vectors in the initialization process gives further improvements.
We briefly experimented with another set of publicly available word vectors trained by Collobert et al. (2011) on Wikipedia,8 and found that word2vec gave far superior performance. It is not clear whether this is due to Mikolov et al. (2013)’s architecture or the 100 billion word Google News dataset.通过word2vec的训练集效果好了很多。但是不清楚到底是模型好,还是因为数据集好
Adadelta (Zeiler, 2012) gave similar results to Adagrad (Duchi et al., 2011) but required fewer epochs.
5 Conclusion
In the present work we have described a series of experiments with convolutional neural networks built on top ofword2vec. Despite little tuning of hyperparameters, a simpleCNNwith one layer of convolution performs remarkably well. Our results add to the well-established evidence that unsupervised pre-training of word vectors is an important ingredient in deep learning for NLP.
关键点:
预训练的词向量——Word2Vec、Glove
卷积神经网络结构——一维卷积、池化层
超参选择——卷积核选择、词向量方式选择
创新点:
提出基于CNN的文本分类模型TextCNN
提出了多种词向量设置方式
在四个文本分类任务上取得最优的结果
对超参进行大量实验和分析
启发点:
在预训练模型的基础上微调能够得到非常好的结果,这说明预训练词向量学习到了一些通用的特征
在预训练词向量的基础上使用简单模型比复杂模型表现的还要好
对于不在预训练词向量中的词,微调能够使得它们能够学习更多的意义。
四、超参选择(另一篇论文:A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification)
Embedding方式
卷积核大小
卷积核个数
激活函数
Dropout
L2正则
4.1 Baseline Configuration
We first consider the performance of a baseline CNN configuration. Specifically, we start with the architectural decisions and hyperparameters used in previous work (Kim, 2014) and described in Table 2. To contextualize the variance in performance attributable to various architecture decisions and hyperparameter settings, it is critical to assess the variance due strictly to the parameter estimation procedure.Most prior work, unfortunately, has not reported such variance, despite a highly stochastic learning procedure(之前的工作,忽略了一些参数的偏差). This variance is attributable to estimation via SGD, random dropout, and random weight parameter initialization.Holding all variables (including the folds) constant, we show that the mean performance calculated via 10-fold cross validation (CV) exhibits relatively high variance over repeated runs. (尽管保持参数均不变,但是10-fold cross的波动仍然很大)We replicated CV experiments 100 times for each dataset(复制100份数据), so that each replication was a 10-fold CV, wherein the folds were fixed. We recorded the average performance foreach replication and report the mean, minimum and maximum average accuracy (or AUC) values observed over 100 replications of CV (that is, we report means and ranges of averages calculated over 10-fold CV).(报告平均值等,就可以看出数值的波动)This provides a sense of the variance we might observe without any changes to the model. We did this for both static and non-static methods. For all experiments, we used the same preprocessing steps for the data as in (Kim, 2014). For SGD, we used the ADADELTA update rule (Zeiler, 2012), and set the minibatch size to 50. We randomly selected 10% of the training data as the validation set for early stopping.
4.2 Effect of input word vectors(embedding 设置)
A nice property of sentence classification models that start with distributed representations of words as inputs is the flexibility such architectures afford to swap in different pre-trained word vectors during model initialization. Therefore, we first explore the sensitivity of CNNs for sentence classification with respect to the input representations used. Specifically, we replaced word2vec with GloVe representations(两种词向量:word2vec和glove). Google word2vec uses a local context window model trained on 100 billion words from Google News (Mikolov et al., 2013), while GloVe is a model based on global wordword co-occurrence statistics (Pennington et al., 2014). We used a GloVe model trained on a corpus of 840 billion tokens of web data. For both word2vec and GloVe we induce 300-dimensional word vectors. We report results achieved using GloVe representations in Table 3. Here we only report non-static GloVe results (which again uniformely outperformed the static variant).
We also experimented with concatenating word2vec and GloVe representations, thus creating 600-dimensional word vectors to be used as input to the CNN. Pre-trained vectors may not always be available for specific words (either in word2vec or GloVe, or both); in such cases, we randomly initialized the corresponding subvectors. Results are reported in the final column of Table 3.
word2vec:300维 100 billion words
glove:300维 840billion words
word2vec + glove:600维
4.3 Effect of filter region size
5 Conclusions
5.1 Summary of Main Empirical Findings
参数固定下效果仍然有波动。Prior work has tended to report only the mean performance on datasets achieved by models. But this overlooks variance due solely to the stochastic inference procedure used. This can be substantial: holding everything constant (including the folds), so that variance is due exclusively to the stochastic inference procedure, we find that mean accuracy (calculated via 10 fold cross-validation) has a range of up to 1.5 points. And the range over the AUC achieved on the irony dataset is even greater – up to 3.4 points (see Table 3). More replication should be performed in future work, and ranges/variances should be reported, to prevent potentially spurious conclusions regarding relative model performance.
We find that, even when tuning them to the task at hand, the choice of input word vector representation (e.g., betweenword2vec and GloVe) has an impact on performance, however different representations perform better for different tasks.(词向量表示会表现的更好,但是不同的词向量在不同的任务中表现不一样)At least for sentence classification, both seem to perform better than using one-hot vectors directly. We note, however, that: (1) this may not be the case if one has a sufficiently large amount of training data(如果数据量足够大,one-hot可能效果更好),and, (2) the recent semi-supervised CNN model proposed by Johnson and Zhang (Johnson and Zhang, 2015) may improve performance, as compared to the simpler version of the model considered here (i.e., proposed in (Johnson and Zhang, 2014)).(使用更复杂的model可能更好)
The filter region size can have a large effect on performance, and should be tuned.
The number offeature maps(卷积核)can also play an important role in the performance, and increasing the number of feature maps will increase the training time of the model.
1-max pooling uniformly outperforms other pooling strategies.
Regularization has relatively little effect on the performance of the model.
5.2 Specific advice to practitioners
Drawing upon our empirical results, we provide the following guidance regarding CNN architecture and hyperparameters for practitioners looking to deploy CNNs for sentence classification tasks.
Consider starting with the basic configuration described in Table 2 and using non-static word2vec or GloVe rather than one-hot vectors. However, if the training dataset size is very large, it may be worthwhile to explore using one-hot vectors. Alternatively, if one has access to a large set of unlabeled in-domain data, (Johnson and Zhang, 2015) might also be an option.
卷积核的选择:Line-search over the single filter region size to find the ‘best’ single region size(通过单卷积核的搜索,选取最优的单卷积核的大小及步数).A reasonable range might be1~10.However, for datasets with very long sentences like CR,it may be worth exploring larger filter region sizes(在长的句子里面可以适当调整卷积核).Once this ‘best’ region size is identified, it may be worth exploring combining multiple filters using regions sizes near this single best size(选出最好的之后,也可以考虑使用临近的组合), given that empirically multiple ‘good’ region sizes always outperformed using only the single best region size.
Alter the number of feature maps for each filter region size from 100 to 600, and when this is being explored, use a small dropout rate (0.0-0.5) and a large max norm constraint. Note that increasing the number of feature maps will increase the running time, so there is a trade-off to consider. Also pay attention whether the best value found is near the border of the range (Bengio, 2012). If the best value is near 600, it may be worth trying larger values.
考虑不同的激活函数:Consider different activation functionsif possible: ReLU and tanh are the best overall candidates. And it might also be worth tryingno activation function at all(也可以不使用激活函数)for our one-layer CNN.
没有必要去常识其他选项:Use 1-max pooling; it does not seem necessary to expend resources evaluating alternative strategies.
正则的选择:Regarding regularization: When increasing the number of feature maps begins to reduce performance, try imposing stronger regularization, e.g., a dropout out rate larger than 0.5.
When assessing the performance of a model (or a particular configuration thereof), it is imperative to consider variance. Therefore, replications of the cross-fold validation procedure should be performed and variances and ranges should be considered.(当评估一个模型的性能(或其特定的配置)时,必须考虑方差。因此,应进行交叉验证程序的重复,并应考虑方差和范围。)
五、研究成果及意义
(一)研究成果
在七个文本分类任务中的四个取得了最好的分类效果
CNN-rand:使用随机初始化向量
CNN-static:使用静态预训练的词向量
CNN-non-static:使用微调的预训练的词向量
CNN-multichannel:同时使用静态预训练的词向量和微调的预训练的词向量
(二)历史意义
开启了基于深度学习的文本分类的序幕
推动了卷积神经网络在自然语言处理的发展