关于评估不同类型系统的测试收集的可靠性

论文标题

关于评估不同类型系统的测试收集的可靠性

On the Reliability of Test Collections for Evaluating Systems of Different Types

论文作者

Yilmaz, Emine, Craswell, Nick, Mitra, Bhaskar, Campos, Daniel

论文摘要

由于基于深度学习的模型越来越多地用于信息检索（IR），因此一个主要的挑战是确保测试收集的可用性以衡量其质量。测试收集是根据各种检索系统的汇总结果生成的，但直到最近才包括深度学习系统。这引起了可重复使用的评估的主要挑战：由于基于深度学习的模型使用外部资源（例如单词嵌入）和高级表示，而不是主要基于词汇相似性的传统方法，因此它们可能返回原始池中未识别的不同类型的相关文档。如果是这样，则使用传统方法构建的测试收集可能会导致深度学习（神经）系统的偏见和不公平的评估结果。本文使用模拟合并来测试测试收集的公平性和可重复性，这表明基于传统系统的汇总可能会导致对深度学习系统的偏见评估。

As deep learning based models are increasingly being used for information retrieval (IR), a major challenge is to ensure the availability of test collections for measuring their quality. Test collections are generated based on pooling results of various retrieval systems, but until recently this did not include deep learning systems. This raises a major challenge for reusable evaluation: Since deep learning based models use external resources (e.g. word embeddings) and advanced representations as opposed to traditional methods that are mainly based on lexical similarity, they may return different types of relevant document that were not identified in the original pooling. If so, test collections constructed using traditional methods are likely to lead to biased and unfair evaluation results for deep learning (neural) systems. This paper uses simulated pooling to test the fairness and reusability of test collections, showing that pooling based on traditional systems only can lead to biased evaluation of deep learning systems.

下载PDF全文

下载文献需遵守相关版权规定

论文标题