稀释的近乎最佳的专家演示，用于指导对话随机政策优化

论文标题

稀释的近乎最佳的专家演示，用于指导对话随机政策优化

Diluted Near-Optimal Expert Demonstrations for Guiding Dialogue Stochastic Policy Optimisation

论文作者

Cordier, Thibault, Urvoy, Tanguy, Rojas-Barahona, Lina M., Lefèvre, Fabrice

论文摘要

学习对话代理可以从与用户的互动中推断出其行为。这些相互作用可以从人类到人类或人机的对话中获取。但是，人类的互动稀缺且昂贵，从而使很少的互动学习至关重要。加速学习过程的一种解决方案是在专家的帮助下指导代理商的探索。我们在本文中介绍了对话政策的几种模仿学习策略，其中指导专家是一项近乎最佳的手工政策。我们将这些策略与基于Q学习和参与者评论的最先进的强化学习方法结合在一起。我们特别提出了一项随机勘探政策，该政策允许对学识渊博的政策和专家进行无缝杂交。我们的实验表明，我们的杂交策略的表现优于几个基线，并且在面对真正的人类时可以加速学习。

A learning dialogue agent can infer its behaviour from interactions with the users. These interactions can be taken from either human-to-human or human-machine conversations. However, human interactions are scarce and costly, making learning from few interactions essential. One solution to speedup the learning process is to guide the agent's exploration with the help of an expert. We present in this paper several imitation learning strategies for dialogue policy where the guiding expert is a near-optimal handcrafted policy. We incorporate these strategies with state-of-the-art reinforcement learning methods based on Q-learning and actor-critic. We notably propose a randomised exploration policy which allows for a seamless hybridisation of the learned policy and the expert. Our experiments show that our hybridisation strategy outperforms several baselines, and that it can accelerate the learning when facing real humans.

下载PDF全文

下载文献需遵守相关版权规定

论文标题