互动性提取性搜索生物医学语料库

论文标题

互动性提取性搜索生物医学语料库

Interactive Extractive Search over Biomedical Corpora

论文作者

Taub-Tabib, Hillel, Shlain, Micah, Sadde, Shoval, Lahav, Dan, Eyal, Matan, Cohen, Yaara, Goldberg, Yoav

论文摘要

我们提出了一个系统，该系统允许生命科学研究人员使用依赖图上的模式以及使用代币序列上的模式和布尔值关键字查询的强大变体来搜索语言注释的科学文本语料库。与以前的基于基于依赖性搜索的尝试相反，我们引入了一种轻巧的查询语言，该语言不需要用户知道潜在的语言表示的详细信息，而是通过提供示例句子和简单的标记来查询语料库。由于有效的语言图形指数和检索引擎，搜索以交互式速度执行。这允许对用户查询的快速探索，开发和完善。我们使用两个语料库上的示例工作流进行演示：PubMed语料库，其中包括14,446,243个PubMed摘要和CORD-19数据集，这是45,000多个研究论文集合，重点是Covid-19。该系统可在https://allenai.github.io/spike上公开获取

We present a system that allows life-science researchers to search a linguistically annotated corpus of scientific texts using patterns over dependency graphs, as well as using patterns over token sequences and a powerful variant of boolean keyword queries. In contrast to previous attempts to dependency-based search, we introduce a light-weight query language that does not require the user to know the details of the underlying linguistic representations, and instead to query the corpus by providing an example sentence coupled with simple markup. Search is performed at an interactive speed due to efficient linguistic graph-indexing and retrieval engine. This allows for rapid exploration, development and refinement of user queries. We demonstrate the system using example workflows over two corpora: the PubMed corpus including 14,446,243 PubMed abstracts and the CORD-19 dataset, a collection of over 45,000 research papers focused on COVID-19 research. The system is publicly available at https://allenai.github.io/spike

下载PDF全文

下载文献需遵守相关版权规定

论文标题