Bag of Tricks for Efficient Text Classification 对于高效率文本分类的一些技巧
一、论文概览:
Abstract 本文提出了一种简单并且高效的 文本分类模型,我们模型的效果 和其他深度学习模型相当,但是 速度快了几个数量级。
1. Introduction 文本分类是自然语言处理中非常重 要的任务,基于深度学习的文本分 类任务效果很好,但是速度很慢。 而线性分类器一般也表现很好,而 且速度快,所以本文提出了一种快 速的线性分类器Fasttext。
2. Experiments 在文本分类任务上和tag预测任 务上都取得了非常好的效果
3. Model architecture 详细介绍了Fasttext的模型结构以 及两个技巧,分别是层次softmax 和N-grams特征。
4. Discussion and conclusion 对论文进行一些总结
二、目标
(一)背景介绍:
深度学习方法:优点(效果好,不用做特征工程,模型简洁)缺点(速度太慢,无法在大规模的文本分类任务上使用)
线性分类器:优点(速度一般都很快,效果还可以。缺点,需要做特征工程,分类效果依赖特征的选?。?/p>
(二)Motivation:
综合深度学习的文本分类模型和机器学习的文本分类模型的优点,达到
速度快
效果好
自动特征工程
(三)Fasttext分类模型
模型介绍
Fasttext和CBOW区别和联系
Bag of tricks:层次softmax、n-gram特征
(四)Fasttext模型和CBOW模型的区别和联系
联系:
1. 都是Log-linear模型,模型非常简单。
2. 都是对输入的词向量做平均,然后进行
预测。
3. 模型结构完全一样。
区别:
1. Fasttext提出的是句子特征,CBOW提出的是上
下文特征。
2. Fasttext需要标注语料,是监督学习,CBOW不
需要标注语料,是无监督学习。
(四)目前的Fasttext存在的问题:
1. 当类别非常多的时候,最后的softmax
速度依旧非常慢。
2. 使用的是词袋模型,没有词序信息。
解决方案:
解决方法:
1. 类似于word2vec,使用层次softmax。
2. 使用n-gram特征
三、论文详解
Abstract
This paper explores a simple and ef?cient baseline for text classi?cation. Our experiments show that our fast text classi?er fastText is often on par with deep learning classi?ers in terms of accuracy, and many orders of magnitude faster for training and evaluation. We can train fastText on more than one billion words in less than ten minutes using a standard multicore CPU, and classify half a million sentences among 312K classes in less than a minute.
1. 本文为文本分类任务提出了一种简单并且高效的基准模型——Fasttext。
2. Fasttext模型在精度上和基于深度学习的分类器平分秋色,但是在训练和测试速度上Fasttext快
上几个数量级。
3. 我们使用标准的多核CPU在10亿词的数据集上训练Fasttext,用时少于10分钟,并且在一分钟内
分类好具有312K类别的50万个句子。
总结:
1. 文本分类是自然语言处理的重要任务,可以用于信息检索、网页搜索、文档分类等。
2. 基于深度学习的方法可以达到非常好的效果,但是速度很慢,限制了文本分类的应用。
3. 基于机器学习的线性分类器效果也很好,有用于大规模分类任务的潜力。
4. 从现在词向量学习中得到的灵感,我们提出了一种新的文本分类方法Fasttext,这种方法能够快
速的训练和测试并且达到和最优结果相似的效果。
(一)Introduction
Text classi?cation is an important task in Natural Language Processing with many applications, such as web search, information retrieval, ranking and document classi?cation (Deerwester et al., 1990; Pang and Lee, 2008). Recently, models based on neural networks have become increasingly popular (Kim, 2014; Zhang and LeCun, 2015; Conneau et al., 2016). While these models achieve very good performance in practice,they tend to be relatively slow both at train and test time, limiting their use on very large datasets. (虽然他们都能达到比较好的效果,但是他们在大规模数据集上训练和测试的时候,都比较慢)
线性模型:Meanwhile, linear classi?ers are often considered as strong baselines for text classi?cation problems (Joachims, 1998; McCallum and Nigam, 1998; Fan et al., 2008). Despite their simplicity, they often obtain stateof-the-art performances if the right features are used (Wang and Manning, 2012). They also have the potential to scale to very large corpus (Agarwal et al., 2014).
引出fasttext:In this work, we explore ways to scale these baselines to very large corpus with a large output space, in the context of text classi?cation. Inspired by the recent work in ef?cient word representation learning (Mikolov et al., 2013; Levy et al., 2015), we show that linear models with a rank constraint and a fast loss approximation can train on a billion words within ten minutes(10分钟之内训练十亿个), while achieving performance on par with the state-of-the-art. We evaluate the quality of our approach fastText1 on two different tasks, namely tag prediction and sentiment analysis.
(二) Model architecture
A simple and ef?cient baseline for sentence classi?cation is to represent sentences as bag of words (BoW) and train a linear classi?er, e.g., a logistic regression or an SVM (Joachims, 1998; Fan et al., 2008). However, linear classi?ers do not share parameters among features and classes. This possibly limits their generalization in the context of large output space where some classes have very few examples. Common solutions to this problem are to factorize the linear classi?er into low rank matrices (Schutze, 1992; Mikolov et al., 2013) or to use multilayer neural networks (Collobert and Weston, 2008; Zhang et al., 2015).
Figure 1 shows a simple linear model with rank constraint. The ?rst weight matrix A is a look-up table over the words. The word representations are then averaged into a text representation, which is in turn fed to a linear classi?er. The text representation is an hidden variable which can be potentially be reused. This architecture is similar to the cbow model of Mikolov et al. (2013), where the middle word is replaced by a label. We use the softmax function f to compute the probability distribution over the prede?ned classes. For a set of N documents, this leads to minimizing the negative loglikelihood over the classes:
where xn is the normalized bag of features of the nth document, yn the label, A and B the weight matrices. This model is trained asynchronously on multiple CPUs using stochastic gradient descent and a linearly decaying learning rate.
四、subword
使用字符级别的ngram
3 Model
In this section, we propose our model to learn word representations while taking into account morphology. Wemodelmorphologybyconsideringsubword units,andrepresentingwordsbyasumofitscharactern-grams. Wewillbeginbypresentingthegeneral framework that we use to train word vectors, then present our subword model and eventually describe how we handle the dictionary of character n-grams.
3.1 Generalmodel
We start by brie?y reviewing the continuous skipgram model introduced by Mikolov et al. (2013b), from which our model is derived. Given a word vocabulary of size W, where a word is identi?ed by its index w ∈ {1,...,W}, the goal is to learn a vectorial representation for each word w. Inspired bythedistributionalhypothesis(Harris,1954),word representationsaretrainedtopredictwellwordsthat appear in its context. More formally, given a large training corpus represented as a sequence of words w1,...,wT,theobjectiveoftheskipgrammodelisto maximize the following log-likelihood:
where the context Ct is the set of indices of words surrounding word wt. The probability of observing a context word wc given wt will be parameterized usingtheaforementionedwordvectors. Fornow,let us consider that we are given a scoring function s which maps pairs of (word, context) to scores in R.
One possible choice to de?ne the probability of a context word is the softmax:
However, such a model is not adapted to our case as itimpliesthat, given a word wt, weonlypredictone context word wc.
The problem of predicting context words can instead be framed as a set of independent binary classi?cation tasks. Then the goal is to independently predict the presence (or absence) of context words. For the word at position t we consider all context words as positive examples and sample negatives at random from the dictionary. For a chosen context position c, using the binary logistic loss, we obtain the following negative log-likelihood:
A natural parameterization for the scoring function s betweenaword wt andacontextword wc istouse word vectors. Let us de?ne for each word w in the vocabularytwovectors uw and vw inRd. Thesetwo vectors are sometimes referred to as input and output vectors in the literature. In particular, we have vectors uwt and vwc,corresponding,respectively,to words wt and wc. Then the score can be computed asthescalarproductbetweenwordandcontextvectors as s(wt,wc) = u> wtvwc. The model described in this section is the skipgram model with negative sampling, introduced by Mikolov et al. (2013b).
3.2 Subwordmodel
By using a distinct vector representation for each word ,the skipgram model ignores the internal structure of words(忽略了词内部的信息).In this section,we propose a different scoring functions, in order to take into account this information.(在这个方面,我们考虑另外一种函数,去考虑词内部的信息)
Each word w is represented as a bag of character n-gram(一个词W可以考虑为字符级别的n-gram). We add special boundary symbols < and > at the beginning and end of words, allowing to distinguish pre?xes and suf?xes from other character sequences. We also include the word w itself in the set of its n-grams, to learn a representation for each word (in addition to character n-grams). Taking the word where and n = 3 as an example, it will be represented by the character n-grams:
Note that the sequence <her> ,corresponding to the word her is different from the tri-gram her from the word where. (注意her这个单词与where中的her的词向量是不同的)In practice, we extract all the n-grams for n greater or equal to 3 and smaller or equal to 6. This is a very simple approach, and different sets of n-grams could be considered,for example taking all pre?xes and suf?xes.
Suppose that you are given a dictionary of ngrams of size G. Given a word w, let us denote by Gw ? {1,...,G} the set of n-grams appearing in w. We associate a vector representation zg to each n-gram g. We represent a word by the sum of the vector representations of its n-grams. We thus obtain the scoring function:
This simple model allows sharing the representations across words, thus allowing to learn reliable representation for rare words.
In order to bound the memory requirements of our model,we use a hashing function that maps n-grams to integers in 1 to K(为了限制模型对内存的要求,我们使用hash将n-gram将多个词映射到一个向量). We hash character sequences using the Fowler-Noll-Vo hashing function (speci?cally the FNV-1a variant).1 We set K = 2.106 below. Ultimately, a word is represented by its index intheworddictionaryandthesetofhashedn-grams it contains.
五、实验结果及分析
(一)多个任务上表现良好
(二)效果好的同时,速度特别快
六、研究成果和意义
(一)成果
Fasttext在多个任务上表现很好
Fasttext在效果很好的同时,速度非???/p>
(二)意义
提出了一种新的文本分类方法——Fasttext,能够进行快速的文本分类,并且效果很好。
提出了一种新的使用子词的词向量训练方法——Fasttext,能够一定程度上解决OOV问题。
将Fasttext开源,使得工业界和学术界能够快速使用Fasttext
(三)fasttext优缺点
优点: 1. 速度非???,并且效果还可以。
2. 有开源实现,可以快速上手使用。
缺点: 1. 模型结构简单,所以目前来说,不是最优的
模型。
2. 因为使用词袋思想,所以语义信息获取有限
(四)总结
关键点
基于深度学习的文本分类方法效果好,但是速度比较慢
基于线性分类器的机器学习方法效果还行,速度也比较快,但是需要做烦琐的特征工程
Fasttext模型
创新点
提出了一种新的文本分类模型---Fasttext模型
提出了一些加快文本分类和使得文本分类效果更好的技巧——层次softmax和n-gram 特征。
在文本分类和tag预测两个任务上得到了又快又好的结果。
启发点
虽然这些深度学习模型能够取得非常好的效果,但是他们在训练和测试的时候到非常慢,这限制了他 们在大数据集上的应用。
然而,线性分类器不同特征和类别之间不共享参数,这可能限制了一些只有少量样本类别的泛化能力。
大部分词向量方法对每个词分配一个独立的词向量,而没有共享参数。特别的是,这些方法忽略了词 之间的内在联系,这对于形态学丰富的语言更加重要