论文标题
避免通过人工智能寻求权力
On Avoiding Power-Seeking by Artificial Intelligence
论文作者
论文摘要
我们不知道如何将非常聪明的AI代理人的行为与人类的利益保持一致。我调查是否没有解决此AI对齐问题的完整解决方案 - 我们可以建立对世界影响有限的智能AI代理,并且不会自主寻求权力。在本文中,我介绍了可实现的实用程序保存(AUP)方法。我证明,AUP在玩具Gridworlds中以及基于Conway的生活游戏中的复杂环境中产生保守的,保存期权的行为。我正式化了避免副作用的问题,该问题提供了一种量化代理商对世界的副作用的方法。我还对AI代理的背景下的寻求权力进行了正式定义,并表明最佳政策倾向于寻求权力。特别是,大多数奖励功能具有避免失活的最佳政策。如果我们要在部署它后停用或纠正智能代理,这是一个问题。我的定理表明,由于大多数代理目标与我们的目标冲突,因此代理人可能会抗拒更正。我扩展了这些定理,以表明寻求权力的激励措施不仅是针对最佳决策者,而且是在各种决策程序下发生的。
We do not know how to align a very intelligent AI agent's behavior with human interests. I investigate whether -- absent a full solution to this AI alignment problem -- we can build smart AI agents which have limited impact on the world, and which do not autonomously seek power. In this thesis, I introduce the attainable utility preservation (AUP) method. I demonstrate that AUP produces conservative, option-preserving behavior within toy gridworlds and within complex environments based off of Conway's Game of Life. I formalize the problem of side effect avoidance, which provides a way to quantify the side effects an agent had on the world. I also give a formal definition of power-seeking in the context of AI agents and show that optimal policies tend to seek power. In particular, most reward functions have optimal policies which avoid deactivation. This is a problem if we want to deactivate or correct an intelligent agent after we have deployed it. My theorems suggest that since most agent goals conflict with ours, the agent would very probably resist correction. I extend these theorems to show that power-seeking incentives occur not just for optimal decision-makers, but under a wide range of decision-making procedures.