论文标题

在压缩图上查询:调查

Subpath Queries on Compressed Graphs: a Survey

论文作者

Prezza, Nicola

论文摘要

文本索引是一个经典的算法问题,已经研究了四十年:给定文本$ t $,将其预处理前处理,以便以后,我们可以快速计算并找到与查询长度成比例的$ t $中的任何字符串(查询模式)的发生。后缀树最早的最佳时间解决方案可以追溯到1973年,并且比纯文本需要多达两个数量级的空间才能存储。在2000年,两项突破性的作品表明,如果没有这个空间的空间,就可以实现高效的查询:将快速索引存储在与文本熵成正比的空间中。这些贡献对生物信息学产生了巨大影响:如今,几乎所有DNA对准器都采用压缩索引。最近的趋势考虑了更强大的压缩方案(字典压缩机)和标记图标记的问题的概括:毕竟,可以将文本视为标记为有向路径的文本。反过来,由于有限状态自动机可以被视为标记图的特定情况,因此这些发现在压缩索引和常规语言理论的领域之间创造了一个桥梁,最终允许索引普通语言,并有望为诸如正则表达匹配等问题提供新的启示。这项调查是对迷人旅程的主要地标的温和介绍,它使我们从后缀树到当今标有图形和普通语言的压缩索引。

Text indexing is a classical algorithmic problem that has been studied for over four decades: given a text $T$, pre-process it off-line so that, later, we can quickly count and locate the occurrences of any string (the query pattern) in $T$ in time proportional to the query's length. The earliest optimal-time solution to the problem, the suffix tree, dates back to 1973 and requires up to two orders of magnitude more space than the plain text just to be stored. In the year 2000, two breakthrough works showed that efficient queries can be achieved without this space overhead: a fast index be stored in a space proportional to the text's entropy. These contributions had an enormous impact in bioinformatics: nowadays, virtually any DNA aligner employs compressed indexes. Recent trends considered more powerful compression schemes (dictionary compressors) and generalizations of the problem to labeled graphs: after all, texts can be viewed as labeled directed paths. In turn, since finite state automata can be considered as a particular case of labeled graphs, these findings created a bridge between the fields of compressed indexing and regular language theory, ultimately allowing to index regular languages and promising to shed new light on problems such as regular expression matching. This survey is a gentle introduction to the main landmarks of the fascinating journey that took us from suffix trees to today's compressed indexes for labeled graphs and regular languages.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源