通用孟加拉语手写素描的大型多目标数据集

论文标题

通用孟加拉语手写素描的大型多目标数据集

A Large Multi-Target Dataset of Common Bengali Handwritten Graphemes

论文作者

Alam, Samiul, Reasat, Tahsin, Sushmit, Asif Shahriyar, Siddiquee, Sadi Mohammad, Rahman, Fuad, Hasan, Mahady, Humayun, Ahmed Imtiaz

论文摘要

从历史上看，拉丁语一直领导着手写的光学特征识别（OCR）研究。由于其拼字图之间的鲜明对比，因此将现有系统从拉丁语到α-串路语言尤其具有挑战性。由于草书写作系统以及在alpha-syllabary语言家族中经常使用变音术，与角色相对应的图形组成部分的分割变得非常困难。我们提出了一个基于素描（单词形成的语言段）的标记方案，该方案使分割的侧面alpha-syllabary单词线性线性，并呈现孟加拉语手写字体的第一个数据集，这些图形通常在日常上下文中使用。该数据集包含411K策划的1295个唯一常用孟加拉语图形的样本。此外，该测试集包含900个不常见的孟加拉语素描，以进行词典绩效评估。该数据集是开源的，作为Kaggle上公共手写素式分类挑战的一部分。该数据集中存在的唯一素描是根据Google Bengali ASR语料库中的共同点选择的。从竞争程序中，我们可以看到，深度学习方法可以概括为训练过程中缺少的字典素描的大量跨度。 www.kaggle.com/c/bengaliai-cv19上的数据集和入门代码。

Latin has historically led the state-of-the-art in handwritten optical character recognition (OCR) research. Adapting existing systems from Latin to alpha-syllabary languages is particularly challenging due to a sharp contrast between their orthographies. The segmentation of graphical constituents corresponding to characters becomes significantly hard due to a cursive writing system and frequent use of diacritics in the alpha-syllabary family of languages. We propose a labeling scheme based on graphemes (linguistic segments of word formation) that makes segmentation in-side alpha-syllabary words linear and present the first dataset of Bengali handwritten graphemes that are commonly used in an everyday context. The dataset contains 411k curated samples of 1295 unique commonly used Bengali graphemes. Additionally, the test set contains 900 uncommon Bengali graphemes for out of dictionary performance evaluation. The dataset is open-sourced as a part of a public Handwritten Grapheme Classification Challenge on Kaggle to benchmark vision algorithms for multi-target grapheme classification. The unique graphemes present in this dataset are selected based on commonality in the Google Bengali ASR corpus. From competition proceedings, we see that deep-learning methods can generalize to a large span of out of dictionary graphemes which are absent during training. Dataset and starter codes at www.kaggle.com/c/bengaliai-cv19.

下载PDF全文

下载文献需遵守相关版权规定

论文标题