论文标题
关于自动机在模式匹配中的实际功能
On the Practical Power of Automata in Pattern Matching
论文作者
论文摘要
经典模式匹配的范例是寻求一个字符串的出现 - 在另一个字符串中 - 在另一个字符中,其中两个字符串都是从字母集$σ$中绘制的。假设文本长度为$ n $,并且图案长度为$ M $,则可以在时间$ O(NM)$中天真地解决此问题。在诺斯(Morris)和普拉特(Pratt)的1977年开创性论文中,开发了自动机,允许任何字母$ o(n)$解决此问题。 该自动机(我们将称为{\ em kmp-automaton},已证明对解决许多其他问题有用。一个值得注意的例子是{\ em参数化模式匹配}模型。在此模型中,比赛中允许从$σ$重命名的符号一致。事实证明,参数化匹配范式在软件工程,计算机视觉和其他应用程序中有用。 长期以来一直怀疑,对于符号均匀随机的文本,幼稚算法的性能和KMP算法的性能。在本文中,我们研究了KMP算法与随机生成的文本上的天真算法的实践效率。我们分析了各种参数下的时间,例如字母大小,图案长度以及文本中图案出现的分布。我们为原始的确切匹配问题和参数化匹配而这样做。尽管这些发现为确切的匹配案例证明了民间传说的智慧,但令人惊讶的是,在参数化匹配案例中,KMP算法的起作用速度要比NAIVE快得多。 我们检查了此假设的DNA文本,并观察到与随机文本相似的行为。我们还显示了一个非常结构化的情况,即自动机更有效。
The classical pattern matching paradigm is that of seeking occurrences of one string - the pattern, in another - the text, where both strings are drawn from an alphabet set $Σ$. Assuming the text length is $n$ and the pattern length is $m$, this problem can naively be solved in time $O(nm)$. In Knuth, Morris and Pratt's seminal paper of 1977, an automaton, was developed that allows solving this problem in time $O(n)$ for any alphabet. This automaton, which we will refer to as the {\em KMP-automaton}, has proven useful in solving many other problems. A notable example is the {\em parameterized pattern matching} model. In this model, a consistent renaming of symbols from $Σ$ is allowed in a match. The parameterized matching paradigm has proven useful in problems in software engineering, computer vision, and other applications. It has long been suspected that for texts where the symbols are uniformly random, the naive algorithm will perform as well as the KMP algorithm. In this paper we examine the practical efficiency of the KMP algorithm vs. the naive algorithm on a randomly generated text. We analyse the time under various parameters, such as alphabet size, pattern length, and the distribution of pattern occurrences in the text. We do this for both the original exact matching problem and parameterized matching. While the folklore wisdom is vindicated by these findings for the exact matching case, surprisingly, the KMP algorithm works significantly faster than the naive in the parameterized matching case. We check this hypothesis for DNA texts, and observe a similar behaviour as in the random text. We also show a very structured case where the automaton is much more efficient.