不确定性下的离线政策选择

论文标题

不确定性下的离线政策选择

Offline Policy Selection under Uncertainty

论文作者

Yang, Mengjiao, Dai, Bo, Nachum, Ofir, Tucker, George, Schuurmans, Dale

论文摘要

政策评估中不确定性的存在显着使现实环境中的政策排名和选择过程变得复杂。我们正式将离线政策选择视为鉴于固定的体验数据集的一组政策前景的学习偏好。尽管人们可以根据其策略价值或高信心间隔的点估计来选择或对策略进行排名，但对策略价值的信念，访问完整的分布可以使更灵活的选择算法在更广泛的下游评估指标下。我们提出了贝叶斯派，以根据随机约束（而不是明确的可能性，尚无可用）来估算这种信念分布的分布校正比的后代。从经验上讲，贝叶斯派在置信区间估计中与现有的最新方法具有很高的竞争力。更重要的是，我们展示了如何使用贝叶斯派估计的信念分布来对任何任意下游策略选择指标进行对政策进行排名，并且我们从经验上证明，该选择程序明显优于现有方法，例如根据平均值或高浓度较低界值估计值对政策进行排名。

The presence of uncertainty in policy evaluation significantly complicates the process of policy ranking and selection in real-world settings. We formally consider offline policy selection as learning preferences over a set of policy prospects given a fixed experience dataset. While one can select or rank policies based on point estimates of their policy values or high-confidence intervals, access to the full distribution over one's belief of the policy value enables more flexible selection algorithms under a wider range of downstream evaluation metrics. We propose BayesDICE for estimating this belief distribution in terms of posteriors of distribution correction ratios derived from stochastic constraints (as opposed to explicit likelihood, which is not available). Empirically, BayesDICE is highly competitive to existing state-of-the-art approaches in confidence interval estimation. More importantly, we show how the belief distribution estimated by BayesDICE may be used to rank policies with respect to any arbitrary downstream policy selection metric, and we empirically demonstrate that this selection procedure significantly outperforms existing approaches, such as ranking policies according to mean or high-confidence lower bound value estimates.

下载PDF全文

下载文献需遵守相关版权规定

论文标题