有效的树结构分类检索

论文标题

有效的树结构分类检索

Efficient tree-structured categorical retrieval

论文作者

Belazzougui, Djamal, Kucherov, Gregory

论文摘要

我们在新框架中研究文档检索问题，其中$ d $文本文档在{\ em类别树}中组织了一个类别的预定义$ h $。这种情况发生，例如与科学文献的生物学或主题分类系统中的分类树木。给定一个字符串模式$ p $和一个类别（类别树中的级别），我们希望有效地检索包含此模式并属于类别的$ t $ \ emph {eptrical单位}。我们为此问题提出了几种有效的解决方案。其中一个使用$ n（\logσ（1+o（1））+\ log d+o（h））+o（δ）$空间和$ o（| p |+t）$查询时间，其中$ n $是文档的总长度，$σ$，$σ$ a n of nore nors n of n of n of n of n of n of n of n of n of n of n of n of n of n of n of n of n of n of n of n of norsees。另一个解决方案使用$ n（\logσ（1+o（1））+o（\ log d））+o（δ）+o（d \ log n）$空间的位和$ o（| p |+t \ log d）$查询时间。最终，我们提出了其他解决方案，这些解决方案以较小的查询时间增加而更加空间。

We study a document retrieval problem in the new framework where $D$ text documents are organized in a {\em category tree} with a pre-defined number $h$ of categories. This situation occurs e.g. with taxomonic trees in biology or subject classification systems for scientific literature. Given a string pattern $p$ and a category (level in the category tree), we wish to efficiently retrieve the $t$ \emph{categorical units} containing this pattern and belonging to the category. We propose several efficient solutions for this problem. One of them uses $n(\logσ(1+o(1))+\log D+O(h)) + O(Δ)$ bits of space and $O(|p|+t)$ query time, where $n$ is the total length of the documents, $σ$ the size of the alphabet used in the documents and $Δ$ is the total number of nodes in the category tree. Another solution uses $n(\logσ(1+o(1))+O(\log D))+O(Δ)+O(D\log n)$ bits of space and $O(|p|+t\log D)$ query time. We finally propose other solutions which are more space-efficient at the expense of a slight increase in query time.

下载PDF全文

下载文献需遵守相关版权规定

论文标题