合成错误数据集生成模仿孟加拉语写作模式

论文标题

合成错误数据集生成模仿孟加拉语写作模式

Synthetic Error Dataset Generation Mimicking Bengali Writing Pattern

论文作者

Sifat, Md. Habibur Rahman, Rahman, Chowdhury Rafeed, Rafsan, Mohammad, Rahman, Md. Hasibur

论文摘要

在使用英语键盘编写孟加拉语时，用户经常犯拼写错误。任何孟加拉语拼写检查器或段落校正模块的准确性在很大程度上取决于其基于的错误数据集类型。此类错误数据集的手动生成是一个繁琐的过程。在这项研究中，我们提出了一种使用QWERTY布局英语键盘分析孟加拉语写作模式，用于从正确的单词中自动拼写错误的孟加拉语单词生成算法。作为我们分析的一部分，我们构成了最常用的孟加拉语单词的列表，在语音上相似的可更换群集，经常误认为可更换的簇，经常误认为插入的插入群集以及一些Juktakkhar（恒定字母群集）的规则，同时处理错误。

While writing Bengali using English keyboard, users often make spelling mistakes. The accuracy of any Bengali spell checker or paragraph correction module largely depends on the kind of error dataset it is based on. Manual generation of such error dataset is a cumbersome process. In this research, We present an algorithm for automatic misspelled Bengali word generation from correct word through analyzing Bengali writing pattern using QWERTY layout English keyboard. As part of our analysis, we have formed a list of most commonly used Bengali words, phonetically similar replaceable clusters, frequently mispressed replaceable clusters, frequently mispressed insertion prone clusters and some rules for Juktakkhar (constant letter clusters) handling while generating errors.

下载PDF全文

下载文献需遵守相关版权规定

论文标题