论文标题
有效的树结构分类检索
Efficient tree-structured categorical retrieval
论文作者
论文摘要
我们在新框架中研究文档检索问题,其中$ d $文本文档在{\ em类别树}中组织了一个类别的预定义$ h $。这种情况发生,例如与科学文献的生物学或主题分类系统中的分类树木。给定一个字符串模式$ p $和一个类别(类别树中的级别),我们希望有效地检索包含此模式并属于类别的$ t $ \ emph {eptrical单位}。我们为此问题提出了几种有效的解决方案。其中一个使用$ n(\logσ(1+o(1))+\ log d+o(h))+o(δ)$空间和$ o(| p |+t)$查询时间,其中$ n $是文档的总长度,$σ$,$σ$ a n of nore nors n of n of n of n of n of n of n of n of n of n of n of n of n of n of n of n of n of n of n of n of norsees。另一个解决方案使用$ n(\logσ(1+o(1))+o(\ log d))+o(δ)+o(d \ log n)$空间的位和$ o(| p |+t \ log d)$查询时间。最终,我们提出了其他解决方案,这些解决方案以较小的查询时间增加而更加空间。
We study a document retrieval problem in the new framework where $D$ text documents are organized in a {\em category tree} with a pre-defined number $h$ of categories. This situation occurs e.g. with taxomonic trees in biology or subject classification systems for scientific literature. Given a string pattern $p$ and a category (level in the category tree), we wish to efficiently retrieve the $t$ \emph{categorical units} containing this pattern and belonging to the category. We propose several efficient solutions for this problem. One of them uses $n(\logσ(1+o(1))+\log D+O(h)) + O(Δ)$ bits of space and $O(|p|+t)$ query time, where $n$ is the total length of the documents, $σ$ the size of the alphabet used in the documents and $Δ$ is the total number of nodes in the category tree. Another solution uses $n(\logσ(1+o(1))+O(\log D))+O(Δ)+O(D\log n)$ bits of space and $O(|p|+t\log D)$ query time. We finally propose other solutions which are more space-efficient at the expense of a slight increase in query time.