论文标题
将欧洲大麻的多军匪徒建模
Modelling Cournot Games as Multi-agent Multi-armed Bandits
论文作者
论文摘要
我们研究了多工具多军强盗(MA-MAB)设置的使用来建模重复的Cournot寡头游戏,在该游戏中,公司作为代理的公司从代表生产数量的一组武器(一个离散价值)中进行选择。代理与独立和独立的匪徒问题互动。在此公式中,每个代理在武器之间做出顺序选择,以最大程度地提高自己的奖励。代理没有任何有关环境的信息;他们只能在采取行动后看到自己的回报。但是,市场需求是总行业产出的固定功能,不允许随机进入或退出市场。鉴于这些假设,我们发现$ε$ - 绿色方法比其他传统的mAb方法提供了更可行的学习机制,因为它不需要对系统的任何其他知识来操作。我们还提出了两种利用有序的动作空间的新颖方法:$ε$ - 果酱+hl和$ε$ -Greedy+El。这些新方法可以通过消除较少有利可图的选择,从而帮助公司专注于更有利可图的行动,因此旨在优化探索。我们使用计算机模拟来研究结果中各种均衡的出现,并对关节累积后悔进行经验分析。
We investigate the use of a multi-agent multi-armed bandit (MA-MAB) setting for modeling repeated Cournot oligopoly games, where the firms acting as agents choose from the set of arms representing production quantity (a discrete value). Agents interact with separate and independent bandit problems. In this formulation, each agent makes sequential choices among arms to maximize its own reward. Agents do not have any information about the environment; they can only see their own rewards after taking an action. However, the market demand is a stationary function of total industry output, and random entry or exit from the market is not allowed. Given these assumptions, we found that an $ε$-greedy approach offers a more viable learning mechanism than other traditional MAB approaches, as it does not require any additional knowledge of the system to operate. We also propose two novel approaches that take advantage of the ordered action space: $ε$-greedy+HL and $ε$-greedy+EL. These new approaches help firms to focus on more profitable actions by eliminating less profitable choices and hence are designed to optimize the exploration. We use computer simulations to study the emergence of various equilibria in the outcomes and do the empirical analysis of joint cumulative regrets.