放大度量空间中有效的无模型增强学习

论文标题

放大度量空间中有效的无模型增强学习

Zooming for Efficient Model-Free Reinforcement Learning in Metric Spaces

论文作者

Touati, Ahmed, Taiga, Adrien Ali, Bellemare, Marc G.

论文摘要

尽管对可证明有效的强化学习算法进行了丰富的研究，但大多数工作都集中在表格代表上，因此很难呈指数级或无限的大型国家行动空间。在本文中，我们考虑了具有连续的状态行动空间的情节增强学习，该空间被认为具有自然指标，以表征不同状态和行动之间的邻近度。我们提出了Zoomrl，这是一种在线算法，该算法利用连续的土匪的想法来通过缩小更有希望和经常访问的地区来学习联合空间的自适应离散化，同时仔细平衡了利用 - 探索权衡取舍。 We show that ZoomRL achieves a worst-case regret $\tilde{O}(H^{\frac{5}{2}} K^{\frac{d+1}{d+2}})$ where $H$ is the planning horizon, $K$ is the number of episodes and $d$ is the covering dimension of the space with respect to the metric.此外，我们的算法享有改进的度量依赖性保证，以反映基础空间的几何形状。最后，我们证明我们的算法对于小规定错误是可靠的。

Despite the wealth of research into provably efficient reinforcement learning algorithms, most works focus on tabular representation and thus struggle to handle exponentially or infinitely large state-action spaces. In this paper, we consider episodic reinforcement learning with a continuous state-action space which is assumed to be equipped with a natural metric that characterizes the proximity between different states and actions. We propose ZoomRL, an online algorithm that leverages ideas from continuous bandits to learn an adaptive discretization of the joint space by zooming in more promising and frequently visited regions while carefully balancing the exploitation-exploration trade-off. We show that ZoomRL achieves a worst-case regret $\tilde{O}(H^{\frac{5}{2}} K^{\frac{d+1}{d+2}})$ where $H$ is the planning horizon, $K$ is the number of episodes and $d$ is the covering dimension of the space with respect to the metric. Moreover, our algorithm enjoys improved metric-dependent guarantees that reflect the geometry of the underlying space. Finally, we show that our algorithm is robust to small misspecification errors.

下载PDF全文

下载文献需遵守相关版权规定

论文标题