论文标题
主题模型4J:主题模型的Java软件包
TopicModel4J: A Java Package for Topic Models
论文作者
论文摘要
主题模型提供了一个灵活的原则框架,用于探索高维同时出现数据中的隐藏结构,并且是文本的常用自然语言处理(NLP)。在本文中,我们设计和实施了Java软件包,即topormodel4j,其中包含13种用于拟合主题模型的代表性算法。 Java编程环境中的主题Model4j为数据分析师提供了一个易于使用的接口,以运行算法,并允许轻松输入和输出数据。此外,此软件包还提供了一些非结构化的文本预处理技术,例如将文本数据分为单词,降低单词,预先形成lemmatization并删除无用的字符,URL和停止单词。
Topic models provide a flexible and principled framework for exploring hidden structure in high-dimensional co-occurrence data and are commonly used natural language processing (NLP) of text. In this paper, we design and implement a Java package, TopicModel4J, which contains 13 kinds of representative algorithms for fitting topic models. The TopicModel4J in the Java programming environment provides an easy-to-use interface for data analysts to run the algorithms, and allow to easily input and output data. In addition, this package provides a few unstructured text preprocessing techniques, such as splitting textual data into words, lowercasing the words, preforming lemmatization and removing the useless characters, URLs and stop words.