内容简介:开源一个文本分析项目
Github
https://github.com/sea-boat/TextAnalyzer
TextAnalyzer
a text analizer that can analyze text. so far, it can extract hot words in a text segment by using tf-idf algorithm,at the same time using a score factor to optimize the final score.
also it provides machine learning to make a classification.
Features
extracting hot words from a text.
- to gather statistics via frequence.
- to gather statistics via by tf-idf algorithm
- to gather statistics via a score factor additionally.
synonym can be recognized
SVM Classificator
this analyzer supports to classify text by svm. it involves vectoring the text. we can train the samples and then make a classification by the model.
for convenience,the model,tfidf and vector will be stored.
kmeans clustering && xmeans clustering
this analyzer supports to clustering text by kmeans and xmeans.
vsm clustering
this analyzer supports to clustering text by vsm.
Dependence
https://github.com/sea-boat/IKAnalyzer-Mirror.git
TODO
- other ml algorithms.
- emotion analization.
How to use
just simple like this
extracting hot words
- indexing a document and get a docId.
long docId = TextIndexer.index(text);
- extracting by docId.
HotWordExtractor extractor = new HotWordExtractor(); List<Result> list = extractor.extract(0, 20, false); if (list != null) for (Result s : list) System.out.println(s.getTerm() + " : " + s.getFrequency() + " : " + s.getScore());
a result contains term,frequency and score.
失业证 : 1 : 0.31436604 户口 : 1 : 0.30099702 单位 : 1 : 0.29152703 提取 : 1 : 0.27927202 领取 : 1 : 0.27581802 职工 : 1 : 0.27381304 劳动 : 1 : 0.27370203 关系 : 1 : 0.27080503 本市 : 1 : 0.27080503 终止 : 1 : 0.27080503
SVM classificator
- training the samples.
SVMTrainer trainer = new SVMTrainer(); trainer.train();
- predicting text classification.
double[] data = trainer.getWordVector(text); trainer.predict(data);
kmeans clustering && xmeans clustering
List<String> list = DataReader.readContent(KMeansCluster.DATA_FILE); int[] labels = new KMeansCluster().learn(list);
vsm clustering
List<String> list = DataReader.readContent(VSMCluster.DATA_FILE); List<String> labels = new VSMCluster().learn(list);
==========广告时间==========
鄙人的新书《Tomcat内核设计剖析》已经在京东预售了,有需要的朋友可以到 https://item.jd.com/12185360.html 进行预定。感谢各位朋友。
=========================
欢迎关注:
以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网
猜你喜欢:- [译] 高效的文本生成方法:LaserTagger 现已开源
- [译] 高效的文本生成方法:LaserTagger 现已开源
- Quill 1.3.3 发布,开源富文本编辑器
- Quill 1.3.4 发布,开源富文本编辑器
- Quill 1.3.5 发布,开源富文本编辑器
- 开源在线文档系统 MrDoc 开始支持富文本编辑器
本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
High Performance Python
Micha Gorelick、Ian Ozsvald / O'Reilly Media / 2014-9-10 / USD 39.99
If you're an experienced Python programmer, High Performance Python will guide you through the various routes of code optimization. You'll learn how to use smarter algorithms and leverage peripheral t......一起来看看 《High Performance Python》 这本书的介绍吧!