任务敏捷的学习来完成新任务

论文标题

任务敏捷的学习来完成新任务

Task-Agnostic Learning to Accomplish New Tasks

论文作者

Zhang, Xianqi, Wang, Xingtao, Liu, Xu, Wang, Wenrui, Fan, Xiaopeng, Zhao, Debin

论文摘要

近年来，强化学习（RL）和模仿学习（IL）在机器人决策方面取得了长足的进步。但是，这些方法对新任务的新任务显示出明显的恶化，这些新任务需要通过新的动作组合来完成。 RL方法遭受奖励功能和分配变化的影响，而IL方法受到不涵盖新任务的专家演示的限制。相比之下，人类可以通过从任务不合时宜的经验中学到的零散的知识轻松完成这些任务。受到这一观察的启发，本文提出了一种任务不足的学习方法（简称为TAL），该方法只能从任务不可能的数据中学习零散的知识以完成新任务。 TAL由四个阶段组成。首先，执行任务不足的探索以收集与环境交互的数据。收集的数据是通过知识图组织的。其次，使用收集的知识图数据提出和培训动作功能提取器，以进行任务不合时宜的知识学习。第三，设计了候选动作生成器，该生成器将动作功能提取器应用于新任务上，以生成多个候选动作集。最后，一个行动提案网络旨在根据环境信息在新任务中产生概率。然后，这些概率用于生成订单信息，以选择要从多个候选操作集执行的操作以构成计划。虚拟室内场景上的实验表明，所提出的方法的表现优于最新的离线RL方法和IL方法的表现超过20％。

Reinforcement Learning (RL) and Imitation Learning (IL) have made great progress in robotic decision-making in recent years. However, these methods show obvious deterioration for new tasks that need to be completed through new combinations of actions. RL methods suffer from reward functions and distribution shifts, while IL methods are limited by expert demonstrations which do not cover new tasks. In contrast, humans can easily complete these tasks with the fragmented knowledge learned from task-agnostic experience. Inspired by this observation, this paper proposes a task-agnostic learning method (TAL for short) that can learn fragmented knowledge only from task-agnostic data to accomplish new tasks. TAL consists of four stages. First, the task-agnostic exploration is performed to collect data from interactions with the environment. The collected data is organized via a knowledge graph. Second, an action feature extractor is proposed and trained using the collected knowledge graph data for task-agnostic fragmented knowledge learning. Third, a candidate action generator is designed, which applies the action feature extractor on a new task to generate multiple candidate action sets. Finally, an action proposal network is designed to produce the probabilities for actions in a new task according to the environmental information. The probabilities are then used to generate order information for selecting actions to be executed from multiple candidate action sets to form the plan. Experiments on a virtual indoor scene show that the proposed method outperforms the state-of-the-art offline RL methods and IL methods by more than 20%.

下载PDF全文

下载文献需遵守相关版权规定

论文标题