开源一个文本分析项目

栏目: 软件资讯 · 发布时间: 7年前

内容简介:开源一个文本分析项目

Github

https://github.com/sea-boat/TextAnalyzer

TextAnalyzer

a text analizer that can analyze text. so far, it can extract hot words in a text segment by using tf-idf algorithm,at the same time using a score factor to optimize the final score.

also it provides machine learning to make a classification.

Features

extracting hot words from a text.

  1. to gather statistics via frequence.
  2. to gather statistics via by tf-idf algorithm
  3. to gather statistics via a score factor additionally.

synonym can be recognized

SVM Classificator

this analyzer supports to classify text by svm. it involves vectoring the text. we can train the samples and then make a classification by the model.

for convenience,the model,tfidf and vector will be stored.

kmeans clustering && xmeans clustering

this analyzer supports to clustering text by kmeans and xmeans.

vsm clustering

this analyzer supports to clustering text by vsm.

Dependence

https://github.com/sea-boat/IKAnalyzer-Mirror.git

TODO

  • other ml algorithms.
  • emotion analization.

How to use

just simple like this

extracting hot words

  1. indexing a document and get a docId.
long docId = TextIndexer.index(text);
  1. extracting by docId.
HotWordExtractor extractor = new HotWordExtractor();
 List<Result> list = extractor.extract(0, 20, false);
 if (list != null) for (Result s : list)
    System.out.println(s.getTerm() + " : " + s.getFrequency() + " : " + s.getScore());

a result contains term,frequency and score.

失业证 : 1 : 0.31436604
户口 : 1 : 0.30099702
单位 : 1 : 0.29152703
提取 : 1 : 0.27927202
领取 : 1 : 0.27581802
职工 : 1 : 0.27381304
劳动 : 1 : 0.27370203
关系 : 1 : 0.27080503
本市 : 1 : 0.27080503
终止 : 1 : 0.27080503

SVM classificator

  1. training the samples.
SVMTrainer trainer = new SVMTrainer();
trainer.train();
  1. predicting text classification.
double[] data = trainer.getWordVector(text);
trainer.predict(data);

kmeans clustering && xmeans clustering

List<String> list = DataReader.readContent(KMeansCluster.DATA_FILE);
int[] labels = new KMeansCluster().learn(list);

vsm clustering

List<String> list = DataReader.readContent(VSMCluster.DATA_FILE);
List<String> labels = new VSMCluster().learn(list);

==========广告时间==========

鄙人的新书《Tomcat内核设计剖析》已经在京东预售了,有需要的朋友可以到 https://item.jd.com/12185360.html 进行预定。感谢各位朋友。

=========================

欢迎关注:

开源一个文本分析项目

以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

High Performance Python

High Performance Python

Micha Gorelick、Ian Ozsvald / O'Reilly Media / 2014-9-10 / USD 39.99

If you're an experienced Python programmer, High Performance Python will guide you through the various routes of code optimization. You'll learn how to use smarter algorithms and leverage peripheral t......一起来看看 《High Performance Python》 这本书的介绍吧!

HTML 压缩/解压工具
HTML 压缩/解压工具

在线压缩/解压 HTML 代码

JSON 在线解析
JSON 在线解析

在线 JSON 格式化工具

HEX CMYK 转换工具
HEX CMYK 转换工具

HEX CMYK 互转工具