论文标题
当人类不同意时,停止测量校准
Stop Measuring Calibration When Humans Disagree
论文作者
论文摘要
校准是评估分类器是否知道何时不知道的流行框架 - 即,其预测概率很好地表明了预测的可能性是正确的。对于人类多数级别,通常估计正确性。最近,对人类多数的校准已经在人类固有地不同意哪个类别适用的任务上进行了衡量。我们表明,在理论上,测量对人类多数的校准在理论上是有问题的,在凭经验上证明了这一点,并在Chaosnli数据集上证明了这一点,并得出了几种实例级别的校准度量,以捕获人类判断的关键统计特性 - 班级频率,排名,排名和入口。
Calibration is a popular framework to evaluate whether a classifier knows when it does not know - i.e., its predictive probabilities are a good indication of how likely a prediction is to be correct. Correctness is commonly estimated against the human majority class. Recently, calibration to human majority has been measured on tasks where humans inherently disagree about which class applies. We show that measuring calibration to human majority given inherent disagreements is theoretically problematic, demonstrate this empirically on the ChaosNLI dataset, and derive several instance-level measures of calibration that capture key statistical properties of human judgements - class frequency, ranking and entropy.