Fasttext

Bag of Tricks for Efficient Text Classification 对于高效率文本分类的一些技巧

一、论文概览:

Abstract 本文提出了一种简单并且高效的 文本分类模型,我们模型的效果 和其他深度学习模型相当,但是 速度快了几个数量级。

1. Introduction 文本分类是自然语言处理中非常重 要的任务,基于深度学习的文本分 类任务效果很好,但是速度很慢。 而线性分类器一般也表现很好,而 且速度快,所以本文提出了一种快 速的线性分类器Fasttext。

2. Experiments 在文本分类任务上和tag预测任 务上都取得了非常好的效果

3. Model architecture 详细介绍了Fasttext的模型结构以 及两个技巧,分别是层次softmax 和N-grams特征。

4. Discussion and conclusion 对论文进行一些总结

二、目标

(一)背景介绍:

深度学习方法:优点(效果好,不用做特征工程,模型简洁)缺点(速度太慢,无法在大规模的文本分类任务上使用)

线性分类器:优点(速度一般都很快,效果还可以。缺点,需要做特征工程,分类效果依赖特征的选?。?/p>

(二)Motivation:

综合深度学习的文本分类模型和机器学习的文本分类模型的优点,达到

速度快

效果好

自动特征工程

(三)Fasttext分类模型

模型介绍

Fasttext和CBOW区别和联系

Bag of tricks:层次softmax、n-gram特征

(四)Fasttext模型和CBOW模型的区别和联系

联系:

1. 都是Log-linear模型,模型非常简单。

2. 都是对输入的词向量做平均,然后进行

预测。

3. 模型结构完全一样。

区别:

1. Fasttext提出的是句子特征,CBOW提出的是上

下文特征。

2. Fasttext需要标注语料,是监督学习,CBOW不

需要标注语料,是无监督学习。

(四)目前的Fasttext存在的问题:

1. 当类别非常多的时候,最后的softmax

速度依旧非常慢。

2. 使用的是词袋模型,没有词序信息。

解决方案:

解决方法:

1. 类似于word2vec,使用层次softmax。

2. 使用n-gram特征

三、论文详解

Abstract

This paper explores a simple and ef?cient baseline for text classi?cation. Our experiments show that our fast text classi?er fastText is often on par with deep learning classi?ers in terms of accuracy, and many orders of magnitude faster for training and evaluation. We can train fastText on more than one billion words in less than ten minutes using a standard multicore CPU, and classify half a million sentences among 312K classes in less than a minute.

1. 本文为文本分类任务提出了一种简单并且高效的基准模型——Fasttext。

2. Fasttext模型在精度上和基于深度学习的分类器平分秋色,但是在训练和测试速度上Fasttext快

上几个数量级。

3. 我们使用标准的多核CPU在10亿词的数据集上训练Fasttext,用时少于10分钟,并且在一分钟内

分类好具有312K类别的50万个句子。

总结:

1. 文本分类是自然语言处理的重要任务,可以用于信息检索、网页搜索、文档分类等。

2. 基于深度学习的方法可以达到非常好的效果,但是速度很慢,限制了文本分类的应用。

3. 基于机器学习的线性分类器效果也很好,有用于大规模分类任务的潜力。

4. 从现在词向量学习中得到的灵感,我们提出了一种新的文本分类方法Fasttext,这种方法能够快

速的训练和测试并且达到和最优结果相似的效果。

(一)Introduction

Text classi?cation is an important task in Natural Language Processing with many applications, such as web search, information retrieval, ranking and document classi?cation (Deerwester et al., 1990; Pang and Lee, 2008). Recently, models based on neural networks have become increasingly popular (Kim, 2014; Zhang and LeCun, 2015; Conneau et al., 2016). While these models achieve very good performance in practice,they tend to be relatively slow both at train and test time, limiting their use on very large datasets. (虽然他们都能达到比较好的效果,但是他们在大规模数据集上训练和测试的时候,都比较慢)

线性模型:Meanwhile, linear classi?ers are often considered as strong baselines for text classi?cation problems (Joachims, 1998; McCallum and Nigam, 1998; Fan et al., 2008). Despite their simplicity, they often obtain stateof-the-art performances if the right features are used (Wang and Manning, 2012). They also have the potential to scale to very large corpus (Agarwal et al., 2014).

引出fasttext:In this work, we explore ways to scale these baselines to very large corpus with a large output space, in the context of text classi?cation. Inspired by the recent work in ef?cient word representation learning (Mikolov et al., 2013; Levy et al., 2015), we show that linear models with a rank constraint and a fast loss approximation can train on a billion words within ten minutes(10分钟之内训练十亿个), while achieving performance on par with the state-of-the-art. We evaluate the quality of our approach fastText1 on two different tasks, namely tag prediction and sentiment analysis.

(二) Model architecture

A simple and ef?cient baseline for sentence classi?cation is to represent sentences as bag of words (BoW) and train a linear classi?er, e.g., a logistic regression or an SVM (Joachims, 1998; Fan et al., 2008). However, linear classi?ers do not share parameters among features and classes. This possibly limits their generalization in the context of large output space where some classes have very few examples. Common solutions to this problem are to factorize the linear classi?er into low rank matrices (Schutze, 1992; Mikolov et al., 2013) or to use multilayer neural networks (Collobert and Weston, 2008; Zhang et al., 2015).

Figure 1 shows a simple linear model with rank constraint. The ?rst weight matrix A is a look-up table over the words. The word representations are then averaged into a text representation, which is in turn fed to a linear classi?er. The text representation is an hidden variable which can be potentially be reused. This architecture is similar to the cbow model of Mikolov et al. (2013), where the middle word is replaced by a label. We use the softmax function f to compute the probability distribution over the prede?ned classes. For a set of N documents, this leads to minimizing the negative loglikelihood over the classes:

where xn is the normalized bag of features of the nth document, yn the label, A and B the weight matrices. This model is trained asynchronously on multiple CPUs using stochastic gradient descent and a linearly decaying learning rate.

四、subword

使用字符级别的ngram

3 Model

In this section, we propose our model to learn word representations while taking into account morphology. Wemodelmorphologybyconsideringsubword units,andrepresentingwordsbyasumofitscharactern-grams. Wewillbeginbypresentingthegeneral framework that we use to train word vectors, then present our subword model and eventually describe how we handle the dictionary of character n-grams.

3.1 Generalmodel

We start by brie?y reviewing the continuous skipgram model introduced by Mikolov et al. (2013b), from which our model is derived. Given a word vocabulary of size W, where a word is identi?ed by its index w ∈ {1,...,W}, the goal is to learn a vectorial representation for each word w. Inspired bythedistributionalhypothesis(Harris,1954),word representationsaretrainedtopredictwellwordsthat appear in its context. More formally, given a large training corpus represented as a sequence of words w1,...,wT,theobjectiveoftheskipgrammodelisto maximize the following log-likelihood:

where the context Ct is the set of indices of words surrounding word wt. The probability of observing a context word wc given wt will be parameterized usingtheaforementionedwordvectors. Fornow,let us consider that we are given a scoring function s which maps pairs of (word, context) to scores in R.

One possible choice to de?ne the probability of a context word is the softmax:

However, such a model is not adapted to our case as itimpliesthat, given a word wt, weonlypredictone context word wc.

The problem of predicting context words can instead be framed as a set of independent binary classi?cation tasks. Then the goal is to independently predict the presence (or absence) of context words. For the word at position t we consider all context words as positive examples and sample negatives at random from the dictionary. For a chosen context position c, using the binary logistic loss, we obtain the following negative log-likelihood:

A natural parameterization for the scoring function s betweenaword wt andacontextword wc istouse word vectors. Let us de?ne for each word w in the vocabularytwovectors uw and vw inRd. Thesetwo vectors are sometimes referred to as input and output vectors in the literature. In particular, we have vectors uwt and vwc,corresponding,respectively,to words wt and wc. Then the score can be computed asthescalarproductbetweenwordandcontextvectors as s(wt,wc) = u> wtvwc. The model described in this section is the skipgram model with negative sampling, introduced by Mikolov et al. (2013b).

3.2 Subwordmodel

By using a distinct vector representation for each word ,the skipgram model ignores the internal structure of words(忽略了词内部的信息).In this section,we propose a different scoring functions, in order to take into account this information.(在这个方面,我们考虑另外一种函数,去考虑词内部的信息)

Each word w is represented as a bag of character n-gram(一个词W可以考虑为字符级别的n-gram). We add special boundary symbols < and > at the beginning and end of words, allowing to distinguish pre?xes and suf?xes from other character sequences. We also include the word w itself in the set of its n-grams, to learn a representation for each word (in addition to character n-grams). Taking the word where and n = 3 as an example, it will be represented by the character n-grams:

Note that the sequence <her> ,corresponding to the word her is different from the tri-gram her from the word where. (注意her这个单词与where中的her的词向量是不同的)In practice, we extract all the n-grams for n greater or equal to 3 and smaller or equal to 6. This is a very simple approach, and different sets of n-grams could be considered,for example taking all pre?xes and suf?xes.

Suppose that you are given a dictionary of ngrams of size G. Given a word w, let us denote by Gw ? {1,...,G} the set of n-grams appearing in w. We associate a vector representation zg to each n-gram g. We represent a word by the sum of the vector representations of its n-grams. We thus obtain the scoring function:

This simple model allows sharing the representations across words, thus allowing to learn reliable representation for rare words.

In order to bound the memory requirements of our model,we use a hashing function that maps n-grams to integers in 1 to K(为了限制模型对内存的要求,我们使用hash将n-gram将多个词映射到一个向量). We hash character sequences using the Fowler-Noll-Vo hashing function (speci?cally the FNV-1a variant).1 We set K = 2.106 below. Ultimately, a word is represented by its index intheworddictionaryandthesetofhashedn-grams it contains.

五、实验结果及分析

(一)多个任务上表现良好

(二)效果好的同时,速度特别快

六、研究成果和意义

(一)成果

Fasttext在多个任务上表现很好

Fasttext在效果很好的同时,速度非???/p>

(二)意义

提出了一种新的文本分类方法——Fasttext,能够进行快速的文本分类,并且效果很好。

提出了一种新的使用子词的词向量训练方法——Fasttext,能够一定程度上解决OOV问题。

将Fasttext开源,使得工业界和学术界能够快速使用Fasttext

(三)fasttext优缺点

优点: 1. 速度非???,并且效果还可以。

2. 有开源实现,可以快速上手使用。

缺点: 1. 模型结构简单,所以目前来说,不是最优的

模型。

2. 因为使用词袋思想,所以语义信息获取有限

(四)总结

关键点

基于深度学习的文本分类方法效果好,但是速度比较慢

基于线性分类器的机器学习方法效果还行,速度也比较快,但是需要做烦琐的特征工程

Fasttext模型

创新点

提出了一种新的文本分类模型---Fasttext模型

提出了一些加快文本分类和使得文本分类效果更好的技巧——层次softmax和n-gram 特征。

在文本分类和tag预测两个任务上得到了又快又好的结果。

启发点

虽然这些深度学习模型能够取得非常好的效果,但是他们在训练和测试的时候到非常慢,这限制了他 们在大数据集上的应用。

然而,线性分类器不同特征和类别之间不共享参数,这可能限制了一些只有少量样本类别的泛化能力。

大部分词向量方法对每个词分配一个独立的词向量,而没有共享参数。特别的是,这些方法忽略了词 之间的内在联系,这对于形态学丰富的语言更加重要

?著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 214,029评论 6 493
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 91,238评论 3 388
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事?!?“怎么了?”我有些...
    开封第一讲书人阅读 159,576评论 0 349
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 57,214评论 1 287
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 66,324评论 6 386
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,392评论 1 292
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,416评论 3 412
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,196评论 0 269
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,631评论 1 306
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,919评论 2 328
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,090评论 1 342
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,767评论 4 337
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,410评论 3 322
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,090评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,328评论 1 267
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,952评论 2 365
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,979评论 2 351