算法中学习人类的价值观和非理性的危险

论文标题

算法中学习人类的价值观和非理性的危险

The dangers in algorithms learning humans' values and irrationalities

论文作者

Gorman, Rebecca, Armstrong, Stuart

论文摘要

为了使人工智能（AI）与人类价值观（或人类偏好）保持一致，它必须首先学习这些价值观。经过培训人类行为的AI系统，风险将人类非理性视为人类价值观，然后针对这些非理性进行优化。简单地学习人类价值仍然会带来风险：AI学习它们不可避免地还将获得有关人类非理性和人类行为/政策的信息。这两者都可能是危险的：了解人类政策可以使AI变得更加强大（无论是部分对齐还是根本不一致），而学习人类的非理性使其可以利用人类而无需提供价值来回报。本文分析了发展人工智能的危险，该危险了解人类的非理性和人类政策，并构建了一个模型推荐系统，该系统具有有关人类偏见，人类政策和人类价值观的各种信息。它得出的结论是，无论AI的力量和知识如何，了解人类的非理性都比人类价值观更危险。因此，AI最好直接学习人类价值观，而不是学习人类的偏见，然后从行为中推导价值。

For an artificial intelligence (AI) to be aligned with human values (or human preferences), it must first learn those values. AI systems that are trained on human behavior, risk miscategorising human irrationalities as human values -- and then optimising for these irrationalities. Simply learning human values still carries risks: AI learning them will inevitably also gain information on human irrationalities and human behaviour/policy. Both of these can be dangerous: knowing human policy allows an AI to become generically more powerful (whether it is partially aligned or not aligned at all), while learning human irrationalities allows it to exploit humans without needing to provide value in return. This paper analyses the danger in developing artificial intelligence that learns about human irrationalities and human policy, and constructs a model recommendation system with various levels of information about human biases, human policy, and human values. It concludes that, whatever the power and knowledge of the AI, it is more dangerous for it to know human irrationalities than human values. Thus it is better for the AI to learn human values directly, rather than learning human biases and then deducing values from behaviour.

下载PDF全文

下载文献需遵守相关版权规定

论文标题