论文标题

通过校准子集选择改善筛选过程

Improving Screening Processes via Calibrated Subset Selection

论文作者

Wang, Lequn, Joachims, Thorsten, Rodriguez, Manuel Gomez

论文摘要

许多选择过程,例如寻找有资格参加医疗试验的患者或在搜索引擎中检索管道的供应,其中最初的筛查阶段将资源集中在候选名单上最有前途的候选人。在本文中,我们研究了保证筛选分类器可以提供的方法,而不是手动构造还是训练。我们发现当前的解决方案不享受无分配的理论保证 - 我们表明,通常,即使对于完美校准的分类器,也总是存在特定的候选人库,其候选名单是次优的。然后,我们开发了一种无分布的筛选算法(称为校准子集选择(CSS)),鉴于任何分类器和一定数量的校准数据,可以发现近乎最佳的候选者候选人,这些候选者包含所需数量的合格候选者。此外,我们表明,在特定组中多次校准给定分类器的CSS变体可以创建具有可证明多样性保证的候选名单。关于美国人口普查调查数据的实验验证了我们的理论结果,并表明我们算法提供的候选名单优于几个竞争基线提供的算法。

Many selection processes such as finding patients qualifying for a medical trial or retrieval pipelines in search engines consist of multiple stages, where an initial screening stage focuses the resources on shortlisting the most promising candidates. In this paper, we investigate what guarantees a screening classifier can provide, independently of whether it is constructed manually or trained. We find that current solutions do not enjoy distribution-free theoretical guarantees -- we show that, in general, even for a perfectly calibrated classifier, there always exist specific pools of candidates for which its shortlist is suboptimal. Then, we develop a distribution-free screening algorithm -- called Calibrated Subset Selection (CSS) -- that, given any classifier and some amount of calibration data, finds near-optimal shortlists of candidates that contain a desired number of qualified candidates in expectation. Moreover, we show that a variant of CSS that calibrates a given classifier multiple times across specific groups can create shortlists with provable diversity guarantees. Experiments on US Census survey data validate our theoretical results and show that the shortlists provided by our algorithm are superior to those provided by several competitive baselines.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源