提出问题的方式：可扩展的问题 - 从文本语料库产生

论文标题

提出问题的方式：可扩展的问题 - 从文本语料库产生

Asking Questions the Human Way: Scalable Question-Answer Generation from Text Corpus

论文作者

Liu, Bang, Wei, Haojie, Niu, Di, Chen, Haolan, He, Yancheng

论文摘要

提出问题的能力在人类和机器智能中都很重要。学习提出问题有助于知识获取，改善问题和机器阅读理解任务，并帮助聊天机器人保持与人类的对话。现有的问题生成模型无效地从非结构化的文本中生成大量高质量的问题 - 答案对，因为给出了答案和输入段落，因此问题生成本质上是一对一的映射。在本文中，我们提出了答案符合式问题的问题生成（ACS-QG），该问题旨在通过模仿人类提出问题的方式来自动从未标记的文本语料库中自动产生高质量和多样化的问题解答。我们的系统包含：i）一个信息提取器，该信息从文本中采样多种类型的辅助信息来指导问题的生成； ii）神经问题生成器产生多样化和可控制的问题，利用提取的辅助信息； iii）神经质量控制器，该神经质量控制器可以根据文本的需求去除低质量生成的数据。我们将问题生成模型与现有方法进行比较，并诉诸于自愿的人类评估，以评估产生的问题 - 答案对的质量。评估结果表明，我们的系统在发电质量方面显着优于最先进的神经问题产生模型，同时可扩展。通过对相对较小的数据进行培训的模型，我们可以从Wikipedia发现的一百万个句子中产生280万个质量质量的问答对。

The ability to ask questions is important in both human and machine intelligence. Learning to ask questions helps knowledge acquisition, improves question-answering and machine reading comprehension tasks, and helps a chatbot to keep the conversation flowing with a human. Existing question generation models are ineffective at generating a large amount of high-quality question-answer pairs from unstructured text, since given an answer and an input passage, question generation is inherently a one-to-many mapping. In this paper, we propose Answer-Clue-Style-aware Question Generation (ACS-QG), which aims at automatically generating high-quality and diverse question-answer pairs from unlabeled text corpus at scale by imitating the way a human asks questions. Our system consists of: i) an information extractor, which samples from the text multiple types of assistive information to guide question generation; ii) neural question generators, which generate diverse and controllable questions, leveraging the extracted assistive information; and iii) a neural quality controller, which removes low-quality generated data based on text entailment. We compare our question generation models with existing approaches and resort to voluntary human evaluation to assess the quality of the generated question-answer pairs. The evaluation results suggest that our system dramatically outperforms state-of-the-art neural question generation models in terms of the generation quality, while being scalable in the meantime. With models trained on a relatively smaller amount of data, we can generate 2.8 million quality-assured question-answer pairs from a million sentences found in Wikipedia.

下载PDF全文

下载文献需遵守相关版权规定

论文标题