生成单词和文档嵌入以进行情感分析

论文标题

生成单词和文档嵌入以进行情感分析

Generating Word and Document Embeddings for Sentiment Analysis

论文作者

Aydın, Cem Rıfkı, Güngör, Tunga, Erkan, Ali

论文摘要

单词的情感从一个语料库到另一个语料库有所不同。一般而言，诱导一般情感词典的语言并使用它们不能为不同的领域产生有意义的结果。在本文中，我们将上下文和监督信息与词典中发生的单词的一般语义表示相结合。单词的上下文有助于我们捕获特定于领域的信息，并且有监督的单词分数表示这些单词的极性。当我们将单词的监督特征与从其词典定义中提取的特征相结合时，我们会观察到成功率的提高。我们尝试基于上下文，监督和词典的方法的组合，并生成原始向量。我们还将Word2Vec方法与手工制作的功能相结合。我们为两个语料库诱导特定于领域的情感向量，即电影域和土耳其语中的Twitter数据集。此后，当我们生成文档矢量并采用使用这些向量的支持向量机方法时，我们的方法的性能优于土耳其语的基线研究，并具有很大的利润。我们还评估了两个英文语料库的模型，这些模型也优于词2VEC方法。它表明我们的方法是跨域，并且可移植到其他语言。

Sentiments of words differ from one corpus to another. Inducing general sentiment lexicons for languages and using them cannot, in general, produce meaningful results for different domains. In this paper, we combine contextual and supervised information with the general semantic representations of words occurring in the dictionary. Contexts of words help us capture the domain-specific information and supervised scores of words are indicative of the polarities of those words. When we combine supervised features of words with the features extracted from their dictionary definitions, we observe an increase in the success rates. We try out the combinations of contextual, supervised, and dictionary-based approaches, and generate original vectors. We also combine the word2vec approach with hand-crafted features. We induce domain-specific sentimental vectors for two corpora, which are the movie domain and the Twitter datasets in Turkish. When we thereafter generate document vectors and employ the support vector machines method utilising those vectors, our approaches perform better than the baseline studies for Turkish with a significant margin. We evaluated our models on two English corpora as well and these also outperformed the word2vec approach. It shows that our approaches are cross-domain and portable to other languages.

下载PDF全文

下载文献需遵守相关版权规定

论文标题