Publications

Refereed Publications

Foundational Challenges in Assuring Alignment and Safety of Large Language Models

Usman Anwar, Abulhair Saparov*, Javier Rando*, Daniel Paleka*, Miles Turpin*, Peter Hase*, Ekdeep Singh*, Erik Jenner*, Stephen Casper*, Oliver Sourbut*, Benjamin Edelman*, Zhaowei Zhang*, Mario Gunther*, Anton Korinek*, Jose Hernandez-Orallo*, Lewis Hammond†, Eric Bigelow†, Alex Pan†, Lauro Langosco†, Tomasz Korbak†, Heidi Zhang†, Ruiqi Zhong†, Seán Ó hÉigeartaigh‡, Gabriel Rachet†, Giulio Corsi‡, Alan Chan‡, Markus Anderljung‡, Lillian Edwards‡, Yoshua Bengio‡, Danqi Chen‡, Samuel Albanie‡, Tegan Maharaj‡, Jakob Foerster‡, Florian Tramer‡, He He‡, Atoosa Kasirzadeh‡, Yejin Choi‡, David Krueger‡

*indicates major contribution, †indicates minor contribution, ‡indicates advisory role.

Publications

Refereed Publications

Interpreting Emergent Planning in Model-Free Reinforcement Learning

Influence Functions for Scalable Data Attribution in Diffusion Models

Stress-Testing Capability Elicitation With Password-Locked Models

Predicting Future Actions of Reinforcement Learning Agents

Implicit meta-learning may lead language models to trust more reliable sources

Foundational Challenges in Assuring Alignment and Safety of Large Language Models

Characterizing Manipulation from AI Systems

Thinker: Learning to Plan and Act

Harms from Increasingly Agentic Algorithmic Systems

Metadata Archaeology: Unearthing Data Subsets by Leveraging Training Dynamics

Broken Neural Scaling Laws

Mechanistic Mode Connectivity

Defining and Characterizing Reward Gaming

Goal Misgeneralization in Deep Reinforcement Learning

Out-of-Distribution Generalization via Risk Extrapolation (REx)

Filling gaps in trustworthy development of AI

Learn more

Contact us