论文标题
来自不可靠组件的可靠版本:使用个人资料隐藏的Markov模型从打印版中估算电子书
Reliable Editions from Unreliable Components: Estimating Ebooks from Print Editions Using Profile Hidden Markov Models
论文作者
论文摘要
Profile Hidden Markov模型是生物序列分析中的流行模型,可用于模拟从书籍,杂志和其他印刷材料中转录的字符的相关序列。本文记录了个人资料hmm的一种应用:自动从不同的印刷版中生成电子书版本。由此产生的电子书几乎拥有出版商准备的电子书中发现的所有所需属性,包括准确的转录和缺少印刷文物,例如终止连字符和跑步标头。该技术对需要以可访问格式的书籍的读者和图书馆具有特殊的好处,并使用了19世纪小说的七本副本来证明。
A profile hidden Markov model, a popular model in biological sequence analysis, can be used to model related sequences of characters transcribed from books, magazines, and other printed materials. This paper documents one application of a profile HMM: automatically producing an ebook edition from distinct print editions. The resulting ebook has virtually all the desired properties found in a publisher-prepared ebook, including accurate transcription and an absence of print artifacts such as end-of-line hyphenation and running headers. The technique, which has particular benefits for readers and libraries that require books in an accessible format, is demonstrated using seven copies of a nineteenth-century novel.