论文标题

CEM MIL播客:用于多模式,多语言和多核心信息访问研究的葡萄牙语文档语料库

Cem Mil Podcasts: A Spoken Portuguese Document Corpus For Multi-modal, Multi-lingual and Multi-Dialect Information Access Research

论文作者

Garmash, Ekaterina, Tanaka, Edgar, Clifton, Ann, Correia, Joana, Jat, Sharmistha, Zhu, Winstead, Jones, Rosie, Karlgren, Jussi

论文摘要

在本文中,我们描述了用于学术研究目的的葡萄牙语播客数据集。我们概述了如何采样数据,有关集合的描述性统计信息,以及有关巴西和葡萄牙方言的分布的信息。我们给出了有关多语言摘要实验的结果,表明可以通过支持英语和葡萄牙语的系统来很好地执行播客笔录。我们还使用文本元数据展示了有关葡萄牙播客类型分类的实验。将该系列与先前发布的英语系列相结合,为多模式,多语言和多dialect播客信息访问研究提供了潜力。

In this paper we describe the Portuguese-language podcast dataset we have released for academic research purposes. We give an overview of how the data was sampled, descriptive statistics over the collection, as well as information about the distribution over Brazilian and Portuguese dialects. We give results from experiments on multi-lingual summarization, showing that summarizing podcast transcripts can be performed well by a system supporting both English and Portuguese. We also show experiments on Portuguese podcast genre classification using text metadata. Combining this collection with previously released English-language collection opens up the potential for multi-modal, multi-lingual and multi-dialect podcast information access research.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源