论文标题
在未知环境中学习动态机制:一种加强学习方法
Learning Dynamic Mechanisms in Unknown Environments: A Reinforcement Learning Approach
论文作者
论文摘要
动态机制设计研究机制设计人员应如何在随着时间的变化环境中分配资源。我们考虑了代理商根据未知的马尔可夫决策过程(MDP)与机理设计师相互作用的问题,在该过程中,代理奖励和机理设计师的状态根据具有未知奖励功能和过渡内核的情节MDP演变。我们专注于线性函数近似的在线设置,并提出了新颖的学习算法,以恢复多个交互作用的动态Vickrey-Clarke-grove(VCG)机制。我们方法的一个关键贡献是将无奖励的在线加强学习(RL)纳入帮助探索丰富的政策空间,以估算动态VCG机制的价格。 We show that the regret of our proposed method is upper bounded by $\tilde{\mathcal{O}}(T^{2/3})$ and further devise a lower bound to show that our algorithm is efficient, incurring the same $Ω(T^{2 / 3})$ regret as the lower bound, where $T$ is the total number of rounds.我们的工作确立了在网上RL解决动态机制设计问题的遗憾保证,而没有基础模型的先验知识。
Dynamic mechanism design studies how mechanism designers should allocate resources among agents in a time-varying environment. We consider the problem where the agents interact with the mechanism designer according to an unknown Markov Decision Process (MDP), where agent rewards and the mechanism designer's state evolve according to an episodic MDP with unknown reward functions and transition kernels. We focus on the online setting with linear function approximation and propose novel learning algorithms to recover the dynamic Vickrey-Clarke-Grove (VCG) mechanism over multiple rounds of interaction. A key contribution of our approach is incorporating reward-free online Reinforcement Learning (RL) to aid exploration over a rich policy space to estimate prices in the dynamic VCG mechanism. We show that the regret of our proposed method is upper bounded by $\tilde{\mathcal{O}}(T^{2/3})$ and further devise a lower bound to show that our algorithm is efficient, incurring the same $Ω(T^{2 / 3})$ regret as the lower bound, where $T$ is the total number of rounds. Our work establishes the regret guarantee for online RL in solving dynamic mechanism design problems without prior knowledge of the underlying model.