论文标题
ISAACS:迭代性软对抗性参与者批评性的安全性
ISAACS: Iterative Soft Adversarial Actor-Critic for Safety
论文作者
论文摘要
机器人在不受控制的环境中的部署要求它们在以前看不见的场景(如不规则的地形和风条件下)进行牢固操作。不幸的是,尽管从强大的最佳控制理论规模较差到高维非线性动力学的严格安全框架,但由更可行的“深”方法计算出的控制策略缺乏保证,并且往往对不确定的操作条件表现出很少的鲁棒性。这项工作引入了一种新颖的方法,可以通过将游戏理论安全性分析与对抗性增强学习在模拟中结合到有限的建模误差,对机器人系统的强大保护控制器的可扩展合成。遵循软性演员批评的计划,寻求安全的后备政策与对抗性的“干扰”代理人共同训练,该代理人旨在引起设计师不确定性允许的模型错误和训练对部署差异的最坏情况。尽管学识渊博的控制策略并不能本质上保证安全,但它用于根据前进性推广来构建具有强大安全性的实时安全过滤器(或盾牌)。该盾牌可以与安全性不足的控制政策结合使用,从而排除了可能导致安全性损失的任何任务驱动的动作。我们在5D赛车模拟器中评估了基于学习的安全方法,将学习的安全政策与数值获得的最佳解决方案进行了比较,并在经验上验证了我们拟议的安全保护屏蔽层的可靠安全保证,以防止最坏情况模型差异。
The deployment of robots in uncontrolled environments requires them to operate robustly under previously unseen scenarios, like irregular terrain and wind conditions. Unfortunately, while rigorous safety frameworks from robust optimal control theory scale poorly to high-dimensional nonlinear dynamics, control policies computed by more tractable "deep" methods lack guarantees and tend to exhibit little robustness to uncertain operating conditions. This work introduces a novel approach enabling scalable synthesis of robust safety-preserving controllers for robotic systems with general nonlinear dynamics subject to bounded modeling error by combining game-theoretic safety analysis with adversarial reinforcement learning in simulation. Following a soft actor-critic scheme, a safety-seeking fallback policy is co-trained with an adversarial "disturbance" agent that aims to invoke the worst-case realization of model error and training-to-deployment discrepancy allowed by the designer's uncertainty. While the learned control policy does not intrinsically guarantee safety, it is used to construct a real-time safety filter (or shield) with robust safety guarantees based on forward reachability rollouts. This shield can be used in conjunction with a safety-agnostic control policy, precluding any task-driven actions that could result in loss of safety. We evaluate our learning-based safety approach in a 5D race car simulator, compare the learned safety policy to the numerically obtained optimal solution, and empirically validate the robust safety guarantee of our proposed safety shield against worst-case model discrepancy.