通过视频进行的无动作预训练的强化学习

论文标题

通过视频进行的无动作预训练的强化学习

Reinforcement Learning with Action-Free Pre-Training from Videos

论文作者

Seo, Younggyo, Lee, Kimin, James, Stephen, Abbeel, Pieter

论文摘要

最近的无监督预训练方法已证明通过学习多个下游任务的有用表示，对语言和视觉域有效。在本文中，我们研究了这种无监督的预训练方法是否也可以有效地基于视觉的增强学习（RL）。为此，我们介绍了一个框架，该框架学习了通过在视频上的生成预训练来理解动态的表示形式。我们的框架由两个阶段组成：我们预先培训无动作的潜在视频预测模型，然后利用预训练的表示形式在看不见的环境上有效地学习动作条件的世界模型。为了在微调过程中纳入其他动作输入，我们引入了一种新的体系结构，该结构将动作条件潜在预测模型堆叠在预先训练的无动作预测模型之上。此外，为了更好地探索，我们提出了一种基于视频的内在奖励，以利用预培训的表示。我们证明，在各种操纵和运动任务中，我们的框架显着提高了基于视力的RL的最终性能和样本效率。代码可在https://github.com/younggyoseo/apv上找到。

Recent unsupervised pre-training methods have shown to be effective on language and vision domains by learning useful representations for multiple downstream tasks. In this paper, we investigate if such unsupervised pre-training methods can also be effective for vision-based reinforcement learning (RL). To this end, we introduce a framework that learns representations useful for understanding the dynamics via generative pre-training on videos. Our framework consists of two phases: we pre-train an action-free latent video prediction model, and then utilize the pre-trained representations for efficiently learning action-conditional world models on unseen environments. To incorporate additional action inputs during fine-tuning, we introduce a new architecture that stacks an action-conditional latent prediction model on top of the pre-trained action-free prediction model. Moreover, for better exploration, we propose a video-based intrinsic bonus that leverages pre-trained representations. We demonstrate that our framework significantly improves both final performances and sample-efficiency of vision-based RL in a variety of manipulation and locomotion tasks. Code is available at https://github.com/younggyoseo/apv.

下载PDF全文

下载文献需遵守相关版权规定

论文标题