交通信号控制上的合作加强学习

论文标题

交通信号控制上的合作加强学习

Cooperative Reinforcement Learning on Traffic Signal Control

论文作者

Chao, Chi-Chun, Hsieh, Jun-Wei, Wang, Bor-Shiun

论文摘要

交通信号控制是一个具有挑战性的现实世界问题，旨在通过在道路交叉口协调车辆移动来最大程度地减少整体旅行时间。现有使用中的流量信号控制系统仍然在很大程度上依赖于过度简化的信息和基于规则的方法。具体而言，绿色/红灯交替的周期性可以视为在策略优化中对每个代理进行更好计划的先验。为了更好地学习这种适应性和预测性先验，传统基于RL的方法只能从只有本地代理的预定义动作池返回固定长度。如果这些代理之间没有合作，则某些代理通常会对其他代理人发生冲突，从而减少整个吞吐量。本文提出了一个合作，多目标建筑，具有年龄段的权重，以更好地估算流量信号控制优化的多个奖励条款，该奖励术语称为合作多目标多代理多代理深层确定性策略梯度（Comma-ddpg）。运行的两种类型的代理，以最大程度地提高不同目标的奖励 - 一种用于每个交叉路口的本地流量优化，另一种用于全球流量等待时间优化。全球代理用于指导本地代理作为帮助更快学习但在推理阶段不使用的手段。我们还提供了解决溶液存在的分析，并为提出的RL优化提供了融合证明。使用亚洲国家的交通摄像机收集的现实世界流量数据进行评估。我们的方法可以有效地将总延迟时间减少60 \％。结果表明，与SOTA方法相比，其优越性。

Traffic signal control is a challenging real-world problem aiming to minimize overall travel time by coordinating vehicle movements at road intersections. Existing traffic signal control systems in use still rely heavily on oversimplified information and rule-based methods. Specifically, the periodicity of green/red light alternations can be considered as a prior for better planning of each agent in policy optimization. To better learn such adaptive and predictive priors, traditional RL-based methods can only return a fixed length from predefined action pool with only local agents. If there is no cooperation between these agents, some agents often make conflicts to other agents and thus decrease the whole throughput. This paper proposes a cooperative, multi-objective architecture with age-decaying weights to better estimate multiple reward terms for traffic signal control optimization, which termed COoperative Multi-Objective Multi-Agent Deep Deterministic Policy Gradient (COMMA-DDPG). Two types of agents running to maximize rewards of different goals - one for local traffic optimization at each intersection and the other for global traffic waiting time optimization. The global agent is used to guide the local agents as a means for aiding faster learning but not used in the inference phase. We also provide an analysis of solution existence together with convergence proof for the proposed RL optimization. Evaluation is performed using real-world traffic data collected using traffic cameras from an Asian country. Our method can effectively reduce the total delayed time by 60\%. Results demonstrate its superiority when compared to SoTA methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题