通过披肩测试评估机器常识

论文标题

通过披肩测试评估机器常识

Evaluating Machine Common Sense via Cloze Testing

论文作者

Qasemi, Ehsan, Kezar, Lee, Pujara, Jay, Szekely, Pedro

论文摘要

语言模型（LMS）显示了对常识的最新表现（CS）问题的回答，但是这种能力是否意味着人类对CS的掌握仍然是一个悬而未决的问题。了解LMS的局限性和优势可以通过开发整合外部CS知识的新方法来帮助研究人员改善这些模型。我们设计了一系列测试和测量，以系统地量化其在CS不同方面的性能。我们建议使用披肩测试与单词嵌入结合使用，以测量LM的稳健性和信心。我们的结果表明，尽管语言模型倾向于达到类似人类的准确性，但他们的信心却不足。未来的工作可以利用这些信息来构建更复杂的系统，例如符号和分布式知识的合奏。

Language models (LMs) show state of the art performance for common sense (CS) question answering, but whether this ability implies a human-level mastery of CS remains an open question. Understanding the limitations and strengths of LMs can help researchers improve these models, potentially by developing novel ways of integrating external CS knowledge. We devise a series of tests and measurements to systematically quantify their performance on different aspects of CS. We propose the use of cloze testing combined with word embeddings to measure the LM's robustness and confidence. Our results show than although language models tend to achieve human-like accuracy, their confidence is subpar. Future work can leverage this information to build more complex systems, such as an ensemble of symbolic and distributed knowledge.

下载PDF全文

下载文献需遵守相关版权规定

论文标题