论文标题
与因果关系奖励的线性组合半伴侣
Linear Combinatorial Semi-Bandit with Causally Related Rewards
论文作者
论文摘要
在一个连续的决策问题中,与武器相关的奖励分布之间具有结构性依赖,这使得确定保证最佳集体结果的替代方案的一部分挑战。因此,除了个人行动的奖励外,学习因果关系对于改善决策策略至关重要。为了解决上述两倍的学习问题,我们开发了“与因果关系奖励的组合半伴侣框架”,在其中,我们通过固定结构方程模型中的有向图对因果关系进行建模。图形信号中的节点观察包括相应的基臂的瞬时奖励,以及其他基本臂的奖励的因果影响而产生的额外术语。目的是最大化长期平均收益,这是基本武器奖励的线性函数,并在很大程度上取决于网络拓扑。为了实现这一目标,我们提出了一项政策,通过学习网络的拓扑来决定因果关系,并同时利用这些知识来优化决策过程。我们为拟议的算法建立了额定的遗憾。使用合成和现实世界数据集的数值实验与几个基准相比,我们提出的方法的出色性能表明了我们所提出的方法的出色性能。
In a sequential decision-making problem, having a structural dependency amongst the reward distributions associated with the arms makes it challenging to identify a subset of alternatives that guarantees the optimal collective outcome. Thus, besides individual actions' reward, learning the causal relations is essential to improve the decision-making strategy. To solve the two-fold learning problem described above, we develop the 'combinatorial semi-bandit framework with causally related rewards', where we model the causal relations by a directed graph in a stationary structural equation model. The nodal observation in the graph signal comprises the corresponding base arm's instantaneous reward and an additional term resulting from the causal influences of other base arms' rewards. The objective is to maximize the long-term average payoff, which is a linear function of the base arms' rewards and depends strongly on the network topology. To achieve this objective, we propose a policy that determines the causal relations by learning the network's topology and simultaneously exploits this knowledge to optimize the decision-making process. We establish a sublinear regret bound for the proposed algorithm. Numerical experiments using synthetic and real-world datasets demonstrate the superior performance of our proposed method compared to several benchmarks.