论文标题
分析汤普森采样,以控制未知线性扩散过程
Analysis of Thompson Sampling for Controlling Unknown Linear Diffusion Processes
论文作者
论文摘要
线性扩散过程是不确定性下动态决策的规范连续时间模型。这些系统根据漂移矩阵的发展,该矩阵指定了预期系统状态的瞬时变化速率,同时还经历了以布朗噪声建模的连续随机干扰。例如,在人工胰腺系统等医学应用中,漂移矩阵代表葡萄糖浓度的内部动力学。随机控制中的经典结果提供了对漂移矩阵的完美知识的最佳策略。但是,实际决策场景通常具有有关漂移的不确定性。在医疗环境中,此类参数是特定于患者的,未知的,需要自适应政策,以有效地学习漂移矩阵,同时确保系统稳定性和最佳性能。 我们研究了汤普森采样(TS)算法,以在未知漂移矩阵的线性扩散过程中进行决策。对于这种设计控制策略的算法,就好像对参数的后验信仰样本完全与未知真理相吻合,我们建立了效率。也就是说,汤普森采样会快速学习最佳的控制动作,仅引起时间的遗憾,还学会了在短时间内稳定系统。据我们所知,这是TS在扩散过程控制问题中的第一个结果。此外,我们在三种涉及血葡萄糖和飞行控制的三种环境中的经验模拟表明,与最先进的算法相比,TS可显着改善遗憾,这表明它以一种更加保护的方式进行了探索。我们的理论分析包括将漂移矩阵的几何形状与扩散过程的最佳控制等表征。
Linear diffusion processes serve as canonical continuous-time models for dynamic decision-making under uncertainty. These systems evolve according to drift matrices that specify the instantaneous rates of change in the expected system state, while also experiencing continuous random disturbances modeled by Brownian noise. For instance, in medical applications such as artificial pancreas systems, the drift matrices represent the internal dynamics of glucose concentrations. Classical results in stochastic control provide optimal policies under perfect knowledge of the drift matrices. However, practical decision-making scenarios typically feature uncertainty about the drift; in medical contexts, such parameters are patient-specific and unknown, requiring adaptive policies for efficiently learning the drift matrices while ensuring system stability and optimal performance. We study the Thompson sampling (TS) algorithm for decision-making in linear diffusion processes with unknown drift matrices. For this algorithm that designs control policies as if samples from a posterior belief about the parameters fully coincide with the unknown truth, we establish efficiency. That is, Thompson sampling learns optimal control actions fast, incurring only a square-root of time regret, and also learns to stabilize the system in a short time period. To our knowledge, this is the first such result for TS in a diffusion process control problem. Moreover, our empirical simulations in three settings that involve blood-glucose and flight control demonstrate that TS significantly improves regret, compared to the state-of-the-art algorithms, suggesting it explores in a more guarded fashion. Our theoretical analysis includes characterization of a certain optimality manifold that relates the geometry of the drift matrices to the optimal control of the diffusion process, among others.