论文标题
评估OCR噪声对语言模型的影响
An Assessment of the Impact of OCR Noise on Language Models
论文作者
论文摘要
神经语言模型是现代自然语言处理应用的骨干。因此,它们在经历了光学特征识别(OCR)的文本遗产集合中的使用也在增加。然而,我们对OCR噪声可能对语言模型产生的影响仍然是有限的。我们使用荷兰语,英语,法语和德语的数据对OCR噪声对各种语言模型的影响进行评估。我们发现,OCR噪声在语言建模上构成了重要的障碍,随着OCR质量的降低,语言模型越来越不同于其无声的目标。在存在小型语料库的情况下,包括PPMI和Word2Vec在内的更简单的模型在这方面始终超过了基于变压器的模型。
Neural language models are the backbone of modern-day natural language processing applications. Their use on textual heritage collections which have undergone Optical Character Recognition (OCR) is therefore also increasing. Nevertheless, our understanding of the impact OCR noise could have on language models is still limited. We perform an assessment of the impact OCR noise has on a variety of language models, using data in Dutch, English, French and German. We find that OCR noise poses a significant obstacle to language modelling, with language models increasingly diverging from their noiseless targets as OCR quality lowers. In the presence of small corpora, simpler models including PPMI and Word2Vec consistently outperform transformer-based models in this respect.