论文标题
高维度的在线行动学习:保守的观点
Online Action Learning in High Dimensions: A Conservative Perspective
论文作者
论文摘要
顺序学习问题在研究和实际应用的几个领域很常见。例子包括动态定价和分类,拍卖的设计和激励措施,并渗透了大量的顺序治疗实验。在本文中,我们将最受欢迎的学习解决方案之一扩展到考虑保守指令的高维环境中,即$ε_t$ - 果岭启发式方法。我们通过分配原始规则在一组限制性的有前途的动作中为更集中的搜索采用全新动作的部分时间来做到这一点。最终的规则对于仍然重视惊喜的实际应用可能很有用,尽管速度降低,同时也对采用异常行动有限制。有了很高的概率,我们发现了合理的界限,即保守的高维衰减$ε_t$ - 绿色规则的累积遗憾。此外,我们为一组可行动作的基础性提供了一个下限,这意味着与其非保守性对应物相比,在保守版本中有所改善的遗憾。此外,我们表明最终用户在确定想要多少安全性时具有足够的灵活性,因为它可以在不影响理论特性的情况下进行调整。我们在模拟练习和使用真实数据集中说明了我们的建议。
Sequential learning problems are common in several fields of research and practical applications. Examples include dynamic pricing and assortment, design of auctions and incentives and permeate a large number of sequential treatment experiments. In this paper, we extend one of the most popular learning solutions, the $ε_t$-greedy heuristics, to high-dimensional contexts considering a conservative directive. We do this by allocating part of the time the original rule uses to adopt completely new actions to a more focused search in a restrictive set of promising actions. The resulting rule might be useful for practical applications that still values surprises, although at a decreasing rate, while also has restrictions on the adoption of unusual actions. With high probability, we find reasonable bounds for the cumulative regret of a conservative high-dimensional decaying $ε_t$-greedy rule. Also, we provide a lower bound for the cardinality of the set of viable actions that implies in an improved regret bound for the conservative version when compared to its non-conservative counterpart. Additionally, we show that end-users have sufficient flexibility when establishing how much safety they want, since it can be tuned without impacting theoretical properties. We illustrate our proposal both in a simulation exercise and using a real dataset.