评估问答系统：判断自然语言的复杂性

论文标题

评估问答系统：判断自然语言的复杂性

Evaluation of Question Answering Systems: Complexity of judging a natural language

论文作者

Farea, Amer, Yang, Zhen, Duong, Kien, Perera, Nadeesha, Emmert-Streib, Frank

论文摘要

问答（QA）系统是自然语言处理（NLP）中最重要，最快发展的研究主题之一。因此，原因是质量保证系统允许人类通过虚拟助手或搜索引擎更自然地与机器进行交互。在过去的几十年中，已经提出了许多质量检查系统来解决不同问题的任务的要求。此外，已经引入了许多错误分数，例如，基于n-gram匹配，单词嵌入或上下文嵌入以测量质量检查系统的性能。这项调查试图对质量检查，质量检查范式，基准数据集和评估技术的一般框架进行系统的概述，以对质量检查系统进行定量评估。后者尤其重要，因为不仅构建了质量检查系统复合体，而且是其评估。我们假设一个原因是，人类判断的定量形式化是一个开放的问题。

Question answering (QA) systems are among the most important and rapidly developing research topics in natural language processing (NLP). A reason, therefore, is that a QA system allows humans to interact more naturally with a machine, e.g., via a virtual assistant or search engine. In the last decades, many QA systems have been proposed to address the requirements of different question-answering tasks. Furthermore, many error scores have been introduced, e.g., based on n-gram matching, word embeddings, or contextual embeddings to measure the performance of a QA system. This survey attempts to provide a systematic overview of the general framework of QA, QA paradigms, benchmark datasets, and assessment techniques for a quantitative evaluation of QA systems. The latter is particularly important because not only is the construction of a QA system complex but also its evaluation. We hypothesize that a reason, therefore, is that the quantitative formalization of human judgment is an open problem.

下载PDF全文

下载文献需遵守相关版权规定

论文标题