论文标题
无监督稳定的有效授权估计
Efficient Empowerment Estimation for Unsupervised Stabilization
论文作者
论文摘要
本质上动机的人工代理人在没有外部提供的奖励的情况下学习有利的行为。以前,已经表明,最大化代理执行器和未来状态(称为授权原则)之间的相互信息可以使动态系统的无监督稳定在直立位置,这是一种原型的固有动机行为,用于直立和行走。这是由于稳定目标与授权目标之间的巧合。不幸的是,基于样本的这种相互信息的估计是具有挑战性的。最近,已经提出了各种赋权的各种下限(VLB)作为解决方案。但是,它们通常是有偏见的,在训练中不稳定,并且样本复杂性很高。在这项工作中,我们提出了一种基于动力学系统作为高斯通道的可训练表示的替代解决方案,这使我们能够通过凸优化有效地计算出无偏的授权估计器。我们证明了我们在不同动态控制系统上基于样本的无监督稳定的解决方案,并通过将其与现有的VLB方法进行比较来显示我们方法的优势。具体而言,我们表明我们的方法具有较低的样本复杂性,在训练中更稳定,具有授权函数的基本特性,并允许从图像中估算授权。因此,我们的方法为各种应用程序开辟了一条更广泛,更容易地采用授权的途径。
Intrinsically motivated artificial agents learn advantageous behavior without externally-provided rewards. Previously, it was shown that maximizing mutual information between agent actuators and future states, known as the empowerment principle, enables unsupervised stabilization of dynamical systems at upright positions, which is a prototypical intrinsically motivated behavior for upright standing and walking. This follows from the coincidence between the objective of stabilization and the objective of empowerment. Unfortunately, sample-based estimation of this kind of mutual information is challenging. Recently, various variational lower bounds (VLBs) on empowerment have been proposed as solutions; however, they are often biased, unstable in training, and have high sample complexity. In this work, we propose an alternative solution based on a trainable representation of a dynamical system as a Gaussian channel, which allows us to efficiently calculate an unbiased estimator of empowerment by convex optimization. We demonstrate our solution for sample-based unsupervised stabilization on different dynamical control systems and show the advantages of our method by comparing it to the existing VLB approaches. Specifically, we show that our method has a lower sample complexity, is more stable in training, possesses the essential properties of the empowerment function, and allows estimation of empowerment from images. Consequently, our method opens a path to wider and easier adoption of empowerment for various applications.