论文标题

Tugebic:土耳其德国双语代码转换语料库

TuGeBiC: A Turkish German Bilingual Code-Switching Corpus

论文作者

and, Jeanine Treffers-Daller, Çetinoğlu, Ozlem

论文摘要

在本文中,我们描述了来自土耳其 - 德国双语者的自发语音样本的收集,转录和注释的过程以及称为Tugebic的语料库的汇编。该研究的参与者是1990年代上半年录制时居住在德国或土耳其的成年土耳其语双语者。数据是手动标记和标准化的,所有专有名称(对话中提到的参与者的名称和位置)都被假名替换。执行了令牌级的自动语言标识,这使得可以从每种语言中建立单词比例。两种语言之间的语料库大致平衡。我们还提供了有关代码转换数量的定量信息,并提供了数据中发现的不同类型的代码切换的示例。由此产生的语料库已免费提供给研究社区。

In this paper we describe the process of collection, transcription, and annotation of recordings of spontaneous speech samples from Turkish-German bilinguals, and the compilation of a corpus called TuGeBiC. Participants in the study were adult Turkish-German bilinguals living in Germany or Turkey at the time of recording in the first half of the 1990s. The data were manually tokenised and normalised, and all proper names (names of participants and places mentioned in the conversations) were replaced with pseudonyms. Token-level automatic language identification was performed, which made it possible to establish the proportions of words from each language. The corpus is roughly balanced between both languages. We also present quantitative information about the number of code-switches, and give examples of different types of code-switching found in the data. The resulting corpus has been made freely available to the research community.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源