论文标题
对风险敏感的马尔可夫决策过程,具有均值和差异的组合指标
Risk-Sensitive Markov Decision Processes with Combined Metrics of Mean and Variance
论文作者
论文摘要
本文研究了无限阶段离散时间马尔可夫决策过程(MDP)的优化问题,其长期平均度量考虑了均值和奖励的差异。这种性能指标很重要,因为平均值表示平均回报,差异表示风险或公平性。但是,方差度量伴侣在各个阶段的奖励中,传统的动态编程是不适用的,因为时间一致性的原理失败了。我们从一个新的角度研究了这个问题,称为基于灵敏度的优化理论。得出了性能差异公式,它可以量化在任何两个不同策略下,MDP的均值合并指标的差异。差异公式可用于生成具有严格改善的均值变化性能的新政策。得出了最佳政策的必要条件和确定性政策的最优性。我们进一步开发了一种具有政策迭代形式的迭代算法,事实证明,该算法在混合和随机策略空间中都融合到本地Optima。特别是,当平均奖励在策略中持续不变时,算法可以保证趋于全球最佳。最后,我们采用我们的方法来研究储能系统中风能的波动,这证明了我们优化方法的潜在适用性。
This paper investigates the optimization problem of an infinite stage discrete time Markov decision process (MDP) with a long-run average metric considering both mean and variance of rewards together. Such performance metric is important since the mean indicates average returns and the variance indicates risk or fairness. However, the variance metric couples the rewards at all stages, the traditional dynamic programming is inapplicable as the principle of time consistency fails. We study this problem from a new perspective called the sensitivity-based optimization theory. A performance difference formula is derived and it can quantify the difference of the mean-variance combined metrics of MDPs under any two different policies. The difference formula can be utilized to generate new policies with strictly improved mean-variance performance. A necessary condition of the optimal policy and the optimality of deterministic policies are derived. We further develop an iterative algorithm with a form of policy iteration, which is proved to converge to local optima both in the mixed and randomized policy space. Specially, when the mean reward is constant in policies, the algorithm is guaranteed to converge to the global optimum. Finally, we apply our approach to study the fluctuation reduction of wind power in an energy storage system, which demonstrates the potential applicability of our optimization method.