论文标题
在部分可观察的蒙特卡洛计划中识别意外决策:一种基于规则的方法
Identification of Unexpected Decisions in Partially Observable Monte-Carlo Planning: a Rule-Based Approach
论文作者
论文摘要
部分可观察到的蒙特卡洛计划(POMCP)是一种强大的在线算法,能够为大量可观察到的马尔可夫决策过程制定近似政策。这种方法的在线性质通过避免完整的策略表示来支持可扩展性。缺乏明确表示会阻碍可解释性。在这项工作中,我们提出了一种基于满足性模量理论(SMT)的方法,用于通过检查其痕迹,即算法产生的信念性观察三胞胎的序列来分析POMCP策略。所提出的方法探讨了政策行为的本地属性,以识别意外决策。我们提出了一个由三个主要步骤组成的痕量分析的迭代过程,i)通过参数逻辑公式来定义问题的定义,描述信念与动作之间的(概率)关系(概率)关系,ii)通过计算逻辑公式来生成答案,以计算逻辑公式的参数,从而最大程度地确定满足的条款的数量(求解了最大值的问题,求解了最大值的分析),III分析,iii III,III分析,III分析,III分析,III,III分析,III分析) POMCP对原始问题做出的决定。我们在Tiger上评估了我们的方法,Tiger是POMDPS的标准基准,以及与移动机器人导航有关的现实问题。结果表明,该方法可以利用人类对领域的知识,在识别意外决策时表现优于最先进的异常检测方法。在我们的测试中,已经实现了高达47%的曲线面积。
Partially Observable Monte-Carlo Planning (POMCP) is a powerful online algorithm able to generate approximate policies for large Partially Observable Markov Decision Processes. The online nature of this method supports scalability by avoiding complete policy representation. The lack of an explicit representation however hinders interpretability. In this work, we propose a methodology based on Satisfiability Modulo Theory (SMT) for analyzing POMCP policies by inspecting their traces, namely sequences of belief-action-observation triplets generated by the algorithm. The proposed method explores local properties of policy behavior to identify unexpected decisions. We propose an iterative process of trace analysis consisting of three main steps, i) the definition of a question by means of a parametric logical formula describing (probabilistic) relationships between beliefs and actions, ii) the generation of an answer by computing the parameters of the logical formula that maximize the number of satisfied clauses (solving a MAX-SMT problem), iii) the analysis of the generated logical formula and the related decision boundaries for identifying unexpected decisions made by POMCP with respect to the original question. We evaluate our approach on Tiger, a standard benchmark for POMDPs, and a real-world problem related to mobile robot navigation. Results show that the approach can exploit human knowledge on the domain, outperforming state-of-the-art anomaly detection methods in identifying unexpected decisions. An improvement of the Area Under Curve up to 47\% has been achieved in our tests.