论文标题

播客短篇小说元数据上的主题建模

Topic Modeling on Podcast Short-Text Metadata

论文作者

Valero, Francisco B., Baranes, Marion, Epure, Elena V.

论文摘要

播客已经成为大量消费的在线内容,尤其是由于生产手段的更广泛可访问性和通过大型流媒体平台的扩展分配。分类系统和信息访问技术通常使用主题作为组织或导航播客集合的主要方法。但是,带有主题的注释播客仍然很有问题,因为指定的编辑类型是广泛,异构或误导性的,或者是由于数据挑战(例如,简短的元数据文本,嘈杂的成绩单)。在这里,我们评估了使用简短文本的主题建模技术从播客元数据,标题和描述中发现相关主题的可行性。我们还提出了一种新的策略,以利用非阴性矩阵分解(NMF)主题建模框架中通常存在于播客元数据中的名称实体(NES)。我们在Spotify和Itunes和Deezer的两个现有数据集上进行的实验,这是一项在线服务提供播客目录的新数据集,这表明我们提出的文档表示形式Neice,可改善主题相连的基础。我们发布了结果的代码,以实验结果。

Podcasts have emerged as a massively consumed online content, notably due to wider accessibility of production means and scaled distribution through large streaming platforms. Categorization systems and information access technologies typically use topics as the primary way to organize or navigate podcast collections. However, annotating podcasts with topics is still quite problematic because the assigned editorial genres are broad, heterogeneous or misleading, or because of data challenges (e.g. short metadata text, noisy transcripts). Here, we assess the feasibility to discover relevant topics from podcast metadata, titles and descriptions, using topic modeling techniques for short text. We also propose a new strategy to leverage named entities (NEs), often present in podcast metadata, in a Non-negative Matrix Factorization (NMF) topic modeling framework. Our experiments on two existing datasets from Spotify and iTunes and Deezer, a new dataset from an online service providing a catalog of podcasts, show that our proposed document representation, NEiCE, leads to improved topic coherence over the baselines. We release the code for experimental reproducibility of the results.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源