论文标题
马尔可夫决策过程具有观察成本:框架和计算罚款方案
Markov decision processes with observation costs: framework and computation with a penalty scheme
论文作者
论文摘要
我们考虑马尔可夫决策过程,仅在选定的观察时间和成本下给出链的状态。最佳策略涉及优化观察时间以及随后的动作值。我们考虑有限的地平线和打折的无限范围问题,以及参数不确定性的扩展。通过将观测作为增强马尔可夫系统的一部分所经过的时间包括在内,该值函数满足了准差异不等式的系统(QVI)。这样的QVI可以看作是互连障碍物问题的扩展。我们证明了这类QVI的比较原则,这意味着解决我们提议的问题的解决方案的唯一性。然后使用惩罚方法来获得任意准确的解决方案。最后,我们在三个应用程序上执行数值实验,以说明我们的框架。
We consider Markov decision processes where the state of the chain is only given at chosen observation times and of a cost. Optimal strategies involve the optimisation of observation times as well as the subsequent action values. We consider the finite horizon and discounted infinite horizon problems, as well as an extension with parameter uncertainty. By including the time elapsed from observations as part of the augmented Markov system, the value function satisfies a system of quasi-variational inequalities (QVIs). Such a class of QVIs can be seen as an extension to the interconnected obstacle problem. We prove a comparison principle for this class of QVIs, which implies uniqueness of solutions to our proposed problem. Penalty methods are then utilised to obtain arbitrarily accurate solutions. Finally, we perform numerical experiments on three applications which illustrate our framework.