论文标题
部分可观测时空混沌系统的无模型预测
Operator Splitting Value Iteration
论文作者
论文摘要
我们为折扣的MDP介绍了新的计划和强化学习算法,该算法利用环境的近似模型来加速价值函数的收敛性。受数值线性代数的分裂方法的启发,我们引入了操作员分裂价值迭代(OS-VI),以解决策略评估和控制问题。当模型足够准确时,OS-VI的收敛速度更快。我们还介绍了称为OS-DYNA的算法的基于示例的版本。与传统的DYNA架构不同,OS-DYNA在存在模型近似误差的情况下仍会收敛到正确的值函数。
We introduce new planning and reinforcement learning algorithms for discounted MDPs that utilize an approximate model of the environment to accelerate the convergence of the value function. Inspired by the splitting approach in numerical linear algebra, we introduce Operator Splitting Value Iteration (OS-VI) for both Policy Evaluation and Control problems. OS-VI achieves a much faster convergence rate when the model is accurate enough. We also introduce a sample-based version of the algorithm called OS-Dyna. Unlike the traditional Dyna architecture, OS-Dyna still converges to the correct value function in presence of model approximation error.