文本相似度的比较在很多场景都可以用到,比如之前CMB的客户地址比较,比如搜索,比如机器人问答…
Python平台的各种解决方案都已经很成熟,算出来的结果也很令人满意,但是用Python做好之后,由于臭名昭著的GIL
(Global Interpreter Lock
),由于缓慢的Looping… 等等原因都令得它的部署很昂贵,要很好的机器才能满足高并发的需求。
山哥留意到Java平台有个叫DL4J
的,在NLP
方面也很成熟,各种神经网络,CPU
,GPU
CUDA
都支持,测评结果说,比Tensorflow
的性能还好些。于是山哥心动了。毕竟Java的性能是钢钢的!尤其是用过JVisualVm
来监控调优性能之后,山哥已经不相信现阶段有比JVM更适合做后台服务的平台了。
在网上搜索了一下例子, 发现DL4J没有一个很好的做文本相似度比较的例子,有的都是利用它算Word2Vec,然后找单词相似度…呃,这个很好,但是离应用还差十八万千里??!得!还是得自己动手!
借鉴了Python的版本,看了各种NLP的论文资料,还是把Solution定位如此:
- 用
TF-IDF
训练语料库 - 基于这个
TF-IDF
模型,计算目标Text的向量,和源Text的向量 - 计算向量余弦相似度(
Cosine Similarity
)
TF-IDF
挺神奇的,它不需要神经网络,但是在某些场景效果特别好,所以据说Google的搜索算法也用它。而且由于不用做神经网络的大量计算,用CPU就够了。真是“又便宜又好,我们一直都拿它当宝!大哥买了送大嫂,大嫂高兴的不得了…”咳咳!
好了,废话一堆,永远比不上几句代码直接。来吧……
Maven
<dependency>
<groupId>org.nd4j</groupId>
<artifactId>nd4j-native-platform</artifactId>
<version>${nd4j.version}</version>
</dependency>
<!-- Core DL4J functionality -->
<dependency>
<groupId>org.deeplearning4j</groupId>
<artifactId>deeplearning4j-nlp</artifactId>
<version>${dl4j.version}</version>
</dependency>
Java (其实是Kotlin)
实例中用默认的英文分词器,中文分词器请看 这里
package com.example.demo
import org.deeplearning4j.bagofwords.vectorizer.TfidfVectorizer
import org.deeplearning4j.text.sentenceiterator.CollectionSentenceIterator
import org.deeplearning4j.text.tokenization.tokenizerfactory.DefaultTokenizerFactory
import org.junit.Test
import org.nd4j.linalg.ops.transforms.Transforms
import org.slf4j.LoggerFactory
import java.util.*
import java.util.concurrent.atomic.AtomicInteger
class TextSimilarityControllerTests {
val logger = LoggerFactory.getLogger(TextSimilarityControllerTests::class.java)
@Test
@Throws(Exception::class)
fun testTfIdfVectorizer() {
val rawLines = Arrays.asList("HSAC Software Development (Guangdong) Ltd",
"HSAC holdings plc",
"Citi Bank",
"HSAC Software Development (IN) Ltd")
val iter = CollectionSentenceIterator(rawLines)
// DefaultTokenizer是英文的,如果是中文,要自己用中文分词器实现,比如Ansj
val tokenizerFactory = DefaultTokenizerFactory()
val vectorizer = TfidfVectorizer.Builder()
.setMinWordFrequency(1)
.setStopWords(ArrayList())
.setTokenizerFactory(tokenizerFactory)
.setIterator(iter)
.build()
vectorizer.fit()
val vector = vectorizer.transform("HSAC Software Development (Guangdong) Limited")
logger.info("TF-IDF vector: " + Arrays.toString(vector.data().asDouble()))
/**
* Compare the similarity, sort desc, and pick top 2 and print it out
*/
val counter = AtomicInteger(1)
rawLines.parallelStream().map { line ->
Pair<Double, String>(Transforms.cosineSim(vector, vectorizer.transform(line)), line)
}.sorted { o1, o2 ->
// Desc
o2.first.compareTo(o1.first)
}.limit(2).forEachOrdered {
logger.info("\n" +
"Here comes ${counter.getAndIncrement().ordinal()} result of Top 2:" +
"\n" +
"line '${it.second}' with sim: ${it.first}")
}
val x=1;
}
}
/**
* To ordinalize the number.
*/
fun Number.ordinal(): String {
val suffix = arrayOf("th", "st", "nd", "rd", "th", "th", "th", "th", "th", "th")
val m = this.toInt() % 100
return this.toString() + suffix[if (m > 3 && m < 21) 0 else m % 10]
}
结果输出。(100%相似为1,越接近就越相似)
21:37:46.306 [main] INFO com.example.demo.TextSimilarityControllerTests - TF-IDF vector: [0.02498774789273739, 0.06020599976181984, 0.0, 0.06020599976181984, 0.0, 0.0, 0.0, 0.12041199952363968, 0.0, 0.0]
21:37:46.340 [ForkJoinPool.commonPool-worker-19] INFO com.example.demo.TextSimilarityControllerTests -Here comes 1st result of Top 2:
line 'HSAC Software Development (Guangdong) Ltd' with sim: 0.9276711940765381
21:37:46.340 [ForkJoinPool.commonPool-worker-19] INFO com.example.demo.TextSimilarityControllerTests -Here comes 2nd result of Top 2:
line 'HSAC Software Development (IN) Ltd' with sim: 0.32648345828056335
哦也~ 大功告成…