论文标题

连续事件检测的概率随机索引

Probabilistic Random Indexing for Continuous Event Detection

论文作者

Singh, Yashank, Chatterjee, Niladri

论文摘要

本文探讨了基于随机索引(RI)表示的新型变体,用于编码语言数据,以期在动态场景中使用它们,以使事件以连续的方式发生。随着用词汇大小的一般方法中的表示形式的大小与词汇的大小进行线性增长,因此它们不可缩小,用于在线目的,具有大量的动态数据。另一方面,由于文本数据的动态性质,现有的预训练嵌入模型不适合检测新事件的事件。目前的工作通过在随机条目的数量上施加概率分布来解决此问题,从而导致一类RI表示形式。它还对表示方法的好处进行了严格的分析,以根据正交性的概率编码语义信息。在这些想法的基础上,我们提出了一种算法,该算法与词汇的大小进行对数线性,以跟踪查询单词与其他单词的语义关系,以暗示与所讨论的单词相关的事件。我们使用所提出的算法进行模拟,以针对三个不同事件的特定于三个不同事件进行推文数据,并提出我们的发现。发现所提出的概率RI表示形式比单词(bow)嵌入的袋子更快,可扩展性要快得多,同时保持了描述语义关系的准确性。

The present paper explores a novel variant of Random Indexing (RI) based representations for encoding language data with a view to using them in a dynamic scenario where events are happening in a continuous fashion. As the size of the representations in the general method of onehot encoding grows linearly with the size of the vocabulary, they become non-scalable for online purposes with high volumes of dynamic data. On the other hand, existing pre-trained embedding models are not suitable for detecting happenings of new events due to the dynamic nature of the text data. The present work addresses this issue by using a novel RI representation by imposing a probability distribution on the number of randomized entries which leads to a class of RI representations. It also provides a rigorous analysis of the goodness of the representation methods to encode semantic information in terms of the probability of orthogonality. Building on these ideas we propose an algorithm that is log-linear with the size of vocabulary to track the semantic relationship of a query word to other words for suggesting the events that are relevant to the word in question. We ran simulations using the proposed algorithm for tweet data specific to three different events and present our findings. The proposed probabilistic RI representations are found to be much faster and scalable than Bag of Words (BoW) embeddings while maintaining accuracy in depicting semantic relationships.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源