使用卷积神经网络进行奖励成型

论文标题

使用卷积神经网络进行奖励成型

Reward Shaping Using Convolutional Neural Network

论文作者

Sami, Hani, Otrok, Hadi, Bentahar, Jamal, Mourad, Azzam, Damiani, Ernesto

论文摘要

在本文中，我们提出了用于奖励成型的价值迭代网络（VIN-RS），这是一种使用卷积神经网络（CNN）的基于潜在的奖励成型机制。所提出的Vin-RS使用隐藏的Markov模型的消息传递机制嵌入了在计算标签上训练的CNN。 CNN处理环境的图像或图形以预测塑形值。关于奖励成型的最新工作仍然对Markov决策过程（MDP）的培训有局限性，并建立了过渡矩阵的估计。 VIN-RS的优点是从估计的MDP中构建有效的潜在函数，同时自动推断环境过渡矩阵。所提出的VIN-RS通过自我学习的卷积过滤器估算了过渡矩阵，同时从输入帧或采样图中提取环境详细信息。由于（1）使用消息传递以进行奖励成型的成功；（2）CNN规划行为，我们使用这些消息来训练VIN-RS的CNN。实验是在Atari 2600和Mujoco的表格游戏上进行的，以获取离散和连续的动作空间。我们的结果表明，与最先进的工作相比，学习速度和最大累积奖励的有望提高。

In this paper, we propose Value Iteration Network for Reward Shaping (VIN-RS), a potential-based reward shaping mechanism using Convolutional Neural Network (CNN). The proposed VIN-RS embeds a CNN trained on computed labels using the message passing mechanism of the Hidden Markov Model. The CNN processes images or graphs of the environment to predict the shaping values. Recent work on reward shaping still has limitations towards training on a representation of the Markov Decision Process (MDP) and building an estimate of the transition matrix. The advantage of VIN-RS is to construct an effective potential function from an estimated MDP while automatically inferring the environment transition matrix. The proposed VIN-RS estimates the transition matrix through a self-learned convolution filter while extracting environment details from the input frames or sampled graphs. Due to (1) the previous success of using message passing for reward shaping; and (2) the CNN planning behavior, we use these messages to train the CNN of VIN-RS. Experiments are performed on tabular games, Atari 2600 and MuJoCo, for discrete and continuous action space. Our results illustrate promising improvements in the learning speed and maximum cumulative reward compared to the state-of-the-art.

下载PDF全文

下载文献需遵守相关版权规定

论文标题