部分可观测时空混沌系统的无模型预测

论文标题

部分可观测时空混沌系统的无模型预测

Robust Policy Optimization in Deep Reinforcement Learning

论文作者

Rahman, Md Masudur, Xue, Yexiang

论文摘要

政策梯度方法享有目标直接优化累积奖励的目标的简单性。此外，在连续的动作域中，动作分布的参数化分布可以轻松控制勘探，这是由于表示分布的方差引起的。熵可以通过选择随机策略在政策优化中发挥重要作用，这最终有助于更好地探索加强学习（RL）的环境。但是，随着训练的进行，随机性通常会减少。因此，政策变得不那么探索性。此外，某些参数分布可能仅适用于某些环境，需要大量的超参数调整。本文旨在减轻这些问题。特别是，我们提出了一种称为鲁棒策略优化（RPO）的算法，该算法利用了扰动的分布。我们假设我们的方法鼓励高渗透行动，并提供了更好地表示动作空间的方法。我们进一步提供经验证据来验证我们的假设。我们评估了有关DeepMind Control，OpenAI Gym，Pybullet和Isaacgym的各种连续控制任务的方法。我们观察到，在许多情况下，RPO在培训的早期就增加了政策熵，然后在整个培训期间保持一定水平的熵。最终，与PPO和其他技术相比，我们的代理RPO始终如一地提高了性能：熵正则化，不同的分布和数据增强。此外，在几种情况下，我们的方法在性能方面保持强大，而其他基线机制无法改善甚至会使性能恶化。

The policy gradient method enjoys the simplicity of the objective where the agent optimizes the cumulative reward directly. Moreover, in the continuous action domain, parameterized distribution of action distribution allows easy control of exploration, resulting from the variance of the representing distribution. Entropy can play an essential role in policy optimization by selecting the stochastic policy, which eventually helps better explore the environment in reinforcement learning (RL). However, the stochasticity often reduces as the training progresses; thus, the policy becomes less exploratory. Additionally, certain parametric distributions might only work for some environments and require extensive hyperparameter tuning. This paper aims to mitigate these issues. In particular, we propose an algorithm called Robust Policy Optimization (RPO), which leverages a perturbed distribution. We hypothesize that our method encourages high-entropy actions and provides a way to represent the action space better. We further provide empirical evidence to verify our hypothesis. We evaluated our methods on various continuous control tasks from DeepMind Control, OpenAI Gym, Pybullet, and IsaacGym. We observed that in many settings, RPO increases the policy entropy early in training and then maintains a certain level of entropy throughout the training period. Eventually, our agent RPO shows consistently improved performance compared to PPO and other techniques: entropy regularization, different distributions, and data augmentation. Furthermore, in several settings, our method stays robust in performance, while other baseline mechanisms fail to improve and even worsen the performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题