DORB：用土匪动态优化多个奖励

论文标题

DORB：用土匪动态优化多个奖励

DORB: Dynamically Optimizing Multiple Rewards with Bandits

论文作者

Pasunuru, Ramakanth, Guo, Han, Bansal, Mohit

论文摘要

事实证明，基于政策梯度的增强学习是一种有前途的方法，可以直接优化针对语言生成任务的非差异性评估指标。但是，优化特定的度量奖励会导致大部分的改进，这仅是指标的改进，这表明该模型正在以特定方式游戏制定该指标的制定，而不会经常实现真正的定性改进。因此，使该模型共同优化多样化的度量奖励更为有益。在吸引人的同时，这是充满挑战的，因为人们需要手动决定这些度量奖励的重要性和扩展权重。此外，重要的是要考虑使用动态组合和度量奖励的课程，这些度量奖励会随着时间的推移灵活而变化。考虑到上述方面，在我们的工作中，我们通过多军强盗方法（DORB）同时对多个度量奖励进行了自动化，在每个回合中，强盗选择基于预期的手臂收益，将指标奖励以优化下一步。我们将EXP3算法用于匪徒，并为强盗奖励制定两种方法：（1）单个多奖励匪徒（SM-Bandit）；（2）分层多奖励匪（HM-Bandit）。我们从经验上通过各种自动指标和人类评估对两个重要的NLG任务进行经验表明我们的方法的有效性：问题生成和数据之间的生成，包括在看不见的测试转移设置上。最后，我们对优化的奖励进行了可解释的匪徒课程的分析。

Policy gradients-based reinforcement learning has proven to be a promising approach for directly optimizing non-differentiable evaluation metrics for language generation tasks. However, optimizing for a specific metric reward leads to improvements in mostly that metric only, suggesting that the model is gaming the formulation of that metric in a particular way without often achieving real qualitative improvements. Hence, it is more beneficial to make the model optimize multiple diverse metric rewards jointly. While appealing, this is challenging because one needs to manually decide the importance and scaling weights of these metric rewards. Further, it is important to consider using a dynamic combination and curriculum of metric rewards that flexibly changes over time. Considering the above aspects, in our work, we automate the optimization of multiple metric rewards simultaneously via a multi-armed bandit approach (DORB), where at each round, the bandit chooses which metric reward to optimize next, based on expected arm gains. We use the Exp3 algorithm for bandits and formulate two approaches for bandit rewards: (1) Single Multi-reward Bandit (SM-Bandit); (2) Hierarchical Multi-reward Bandit (HM-Bandit). We empirically show the effectiveness of our approaches via various automatic metrics and human evaluation on two important NLG tasks: question generation and data-to-text generation, including on an unseen-test transfer setup. Finally, we present interpretable analyses of the learned bandit curriculum over the optimized rewards.

下载PDF全文

下载文献需遵守相关版权规定

论文标题