Search

Risk Alignment in Agentic AI Systems

Share with
Or share with link

Introduction

Proper alignment is a tetradic affair, involving relationships among AIs, their users, their developers, and society at large (Gabriel, et al. 2024). Agentic AIs—AIs that are capable and permitted to undertake complex actions with little supervision—mark a new frontier in AI capabilities. Accordingly, they raise new questions about how to safely create and align such systems. Existing AIs, such as LLM chatbots, primarily provide information that human users can use to plan actions. Thus, while chatbots may have significant effects on society, those effects are largely filtered through human agents. Because the introduction of agentic AIs would mark the introduction of a new kind of actor into society, their effects on society will arguably be more significant and unpredictable, thus raising uniquely difficult questions of alignment in all of its aspects.

Here, we focus on an underappreciated1 aspect of alignment: what attitudes toward risk should guide an agentic AI’s decision-making? An agent’s risk attitudes describe certain dispositions when making decisions under uncertainty. A risk-averse agent disfavors bets that have high variance in possible outcomes, preferring an action with a high chance of a decent outcome over one that has a lower probability of an even better outcome. A risk seeking agent is willing to tolerate much higher risks of failure if the potential upside is great enough. People exhibit diverse and sometimes very significant risk attitudes. How should an agentic AI’s risk attitudes be fixed in order to achieve alignment with users? What guardrails, if any, should be placed on the range of permissible risk attitudes in order to achieve alignment with society and designers of AI systems? What are the ethical considerations involved when making risky decisions on behalf of others?

We present three papers that bear on key normative and technical aspects of these questions.

In the first paper, we examine the relationship between agentic AIs and their users. An agentic AI is “aligned with a user when it benefits the user, when they ask to be benefitted, in the way they expect to be benefitted” (Gabriel, et al. 2024, 34). Because individuals’ risk attitudes strongly influence the actions they take and approve of, getting risk attitudes right will be a central part of agentic AI alignment. We propose two models for thinking about the relationship between agentic AIs and their users – the proxy model and off-the-shelf tool model – and their different implications for risk alignment.

In the second paper, we focus on developers of agentic AI. Developers have important interests and moral duties that will be affected by the risk attitudes of agentic AIs that they produce, since AIs with reckless attitudes toward risk can expose developers to legal, reputational, and moral liability. We explore how developers can navigate shared responsibility among users, developers, and agentic AIs to best protect their interests and fulfill their moral obligations.

In the third paper, we turn to more technical questions about how agentic AIs might be calibrated to the risk attitudes of their users. We evaluate how imitation learning, prompting, and preference modeling might be used to adapt models to information about users’ risk attitudes, focusing on the kinds of data that we would need for each learning process. Then, we evaluate methods for eliciting these kinds of data about risk attitudes, arguing that some methods are much more reliable and valid than others. We end with recommendations for how agentic AIs can be created that best achieve alignment with users and developers.

Notes


  1. This topic isn’t explicitly addressed in recent work on agentic AI alignment from Shavit, et al. (2023) or Gabriel, et al. (2024). 

Acknowledgments

This report is a project of Rethink Priorities. The authors are Hayley Clatterbuck, Clinton Castro, and Arvo Muñoz Morán. Thanks to Jamie Elsey, Bob Fischer, David Moss, and Willem Sleegers for helpful discussions and feedback. This work was supported by funding from OpenAI under a Research into Agentic AI Systems grant.