fazlB.jpg

Fazl Barez

I am a Principal Investigator at the University of Oxford, where I lead research on technical AI safety, interpretability, and governance. I teach Oxford's AI Safety and Alignment course and lead the research team at Martian, where we work on understanding how intelligence works. My research is supported by OpenAI, Anthropic, Schmidt Sciences, NVIDIA, and others.

My work has focused on understanding what happens inside neural networks and using that understanding to make AI systems safer. As models grow more capable, I think the most urgent challenge in interpretability is moving from observation to action—building systems where we can trace a model's internal reasoning, verify it, and correct it when something is wrong. Specific directions I'm working on:

• When a model produces a surprising or harmful output, how can we trace the internal cause—automatically and at scale?
• How can we remove a dangerous capability from a model and be confident it won't come back?
• When a model shows its reasoning, how do we know it actually computed things that way—and how do we give regulators evidence?

Affiliations: Cambridge CSER, NTU Digital Trust Centre, Edinburgh Informatics, ELLIS.
Beyond my core research, I'm drawn to neuroscience, psychology, philosophy of science, poetry, and literature.

I'm looking for motivated collaborators in AI safety, interpretability, and technical AI governance. If my Research Agenda speaks to your interests, I'd love to hear which directions excite you and how your work connects. I'm especially committed to partnering with researchers from underrepresented and disadvantaged backgrounds. Reach out at fazl[at]robots[dot]ox[dot]ac[dot]uk.

News

Feb 2026 🎉 2 papers accepted at ICLR 2026! 🇧🇷
Oct 2025 Gave a talk at Sogang University
Oct 2025 Gave a talk at the Seoul AI Safety & Security Forum
Sep 2025 🎉 3 papers accepted at NeurIPS 2025! 🇩🇰🇲🇽🇺🇸
Aug 2025 🎉 4 papers accepted at EMNLP 2025 - see you all in China! 🇨🇳
Jul 2025 🏫 Thrilled to be speaking at the Human-aligned AI Summer School in Prague! 🇨🇿 Four intensive days of AI alignment research, July 22-25, 2025. 🚀 Applications are open!
Jul 2025 🎉 Excited to be a panelist at the Actionable Interpretability workshop at ICML 2025! 🧠✨ July 19th in Vancouver 🇨🇦
Jul 2025 🌟 Excited to speak at Post-AGI Civilizational Equilibria: Are there any good ones? in Vancouver! 🇨🇦 July 14th, 2025 (co-located with ICML) 🤖✨
Mar 2025 Invited talk on open problems in machine unlearning for AI safety at Ploutos.
Mar 2025 Invited to lecture on AI Safety and Alignment at Oxford Machine Learning Summer School in August.
Feb 2025 I am serving as an Area Chair for ACL 2025.
Jan 2025 Our paper Towards interpreting visual information processing in vision-language models accepted at ICLR 2025! See you in Singapore.
Nov 2024 Excited to speak about unlearning and Safety at the How To Evaluate AI Privacy Tutorial at NeurIPS 2024!🍁🇨🇦
Sep 2024 Our paper Interpreting LFPs in Large Language Models has been accepted at NeurIPS 2024! See you in Vancouver 🍁🇨🇦
Sep 2024 Two papers accepted at EMNLP 2024! See you in Miami ❤️
Aug 2024 Excited to be presenting our work on LLMs Relearn Removed Concepts at ACL 2024! See you all in Bangkok 🇹🇭!
Aug 2024 Co-organised the first workshop on Mechanistic Interpretability at ICML 2024! 🎉
Aug 2024 Presented 2 main confrence and 1 workshop paper at ICML 2024! 🎉
Jun 2024 New paper!🤖 Investigating Reward Tampering Sycophancy To Subterfuge: Investigating Reward Tampering In Language Models
May 2024
Our paper on how LLMs relearn removed concepts has been accepted at ACL 2024🎉 See you in Bangkok 🇹🇭
May 2024 I gave a talk about Machine Unlearning at Foresights AGI workshop 🇺🇸
May 2024 Two papers accepted at ICML 2024 and excited to be co-organising the first MI workshop in Vienna ❤️: ICML 2024
Apr 2024 Excited to be serving as a Programme Committee at ECAI 2024 🇪🇸: 27th European Conference on Artificial Intelligence
Apr 2024 Invited speaker at AI Safety Colloquium at KIAST, South Korea 🇰🇷: Introduction to AI safety: Can we remove undesired behaviour from AI?
Mar 2024 Gave a keynote at EACL 2024 Personalization Workshop in Malta 🇲🇹: PERSONALIZE @ EACL 2024
Jan 2024 Our paper has been acceoted at ICLR 2024! 🤗 See you In Vienna 🇦🇹: Understanding Addition In Transformers
Jan 2024 New Paper 🎉: Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Jan 2024 New Paper 🎉: Can language models relearn removed concepts?
Dec 2023 Presented Measuring Value Alignment at NeurIPS 2023, New Orleans 🗽: Presentation

Research Overview

My research spans four areas, connected by a common thread: understanding what’s happening inside models, fixing what’s wrong, and building the tools and standards that make AI systems trustworthy and empowering. My research is published at top venues including NeurIPS, ICML, ICLR, ACL, EMNLP, and FAccT.

Interpretability—We build systems that make consequential decisions, but we often can’t explain how or why they made them. I work on changing that—figuring out which parts of a model drive a decision, what it’s actually doing when it fails, and what to do when something is wrong.

Safety & Alignment—Seeing inside a model could help catch what testing alone can miss. A model can pass existing evaluations and still behave differently in practice—retaining behaviours we thought we’d removed, or producing confident answers that aren’t grounded in what the model actually processed. I study how that gap emerges and how to close it.

Technical Governance—Understanding models matters more when it’s actionable. I work on translating what we find inside models into structured methods that regulators and auditors can use to evaluate safety claims, verify that interventions worked, and hold those responsible accountable.

Societal Impact—As AI systems become more capable, there’s a risk that humans gradually lose agency—over decisions, institutions, and the systems that shape society. I study how these dynamics emerge and what technical and institutional interventions can keep humans in charge.

For more details, see my Research Agenda and recent papers.

Selected Publications

* = Equal Contribution | For a complete list, visit my Google Scholar

2025

  1. chain_of_thought_figure1.jpg
    Under Review
    Fazl Barez, Tung-Yu Wu, Iván Arcuschin, Michael Lan, Vincent Wang, Noah Siegel, Nicolas Collignon, Clement Neo, Isabelle Lee, Alasdair Paren, Adel Bibi, Robert Trager, Damiano Fornasiere, John Yan, Yanai Elazar, and Yoshua Bengio
    2025
  2. democracy_figure.jpg
    Under Review
    Fazl Barez, Isaac Friend, Keir Reid, Igor Krawczuk, Vincent Wang, Jakob Mökander, Philip Torr, Julia Morse, and Robert Trager
    2025
  3. unlearn_figure.jpg
    Under Review
    Fazl Barez, Tingchen Fu, Ameya Prabhu, Stephen Casper, Amartya Sanyal, Adel Bibi, Aidan O’Gara, Robert Kirk, Ben Bucknall, Tim Fist, Luke Ong, Philip Torr, Kwok-Yan Lam, Robert Trager, David Krueger, Sören Mindermann, José Hernandez-Orallo, Mor Geva, and Yarin Gal
    2025
  4. paper_13121_main.jpg
    ICLR 2025
    Clement Neo, Luke Ong, Philip Torr, Mor Geva, David Krueger, and Fazl Barez
    2025

2024

  1. nuerips24.jpg.png
    NeurIPS 2024
    Luke Marks*, Amir Abdullah*, Clement Neo, Rauno Arike, David Krueger, Philip Torr, and Fazl Barez*
    2024
  2. att-mlp.jpeg
    EMNLP 2024
    Clement Neo*, Shay B. Cohen, and Fazl Barez*
    2024

Teaching & Mentoring

Teaching Experience

2025
Autonomous Intelligent Machines and Systems (AIMS), Information Engineering, University of Oxford
Intensive course on AI safety and alignment
2024
Guest Lecturer - Oxford Machine Learning Summer School
University of Oxford, UK
Lectured on AI Safety and Mechanistic Interpretability
2020
Introductory Applied Machine Learning (Semester 2)
University of Edinburgh, UK
Tutor and Lab Demonstrator
2020
Reinforcement Learning
University of Edinburgh, UK
Tutor and Lab Demonstrator
2019
Reinforcement Learning
University of Edinburgh, UK
Tutor and Lab Demonstrator
2019
Introductory Applied Machine Learning (Semester 1)
University of Edinburgh, UK
Tutor and Lab Demonstrator

Past Mentees

Minseon Kim
Now at Microsoft Research
Michelle Lo
Now at DeepMind
Michael Lan
Now at Martian
Lovis Heindrich
Now at Epoch
Philip Quirke
Now at Martian
Clement Neo
Now at DTC Singapore
Alex Foot
Now at Ripjar
Luke Marks
Now at MATS

Keynotes and Invited Talks

2025
Invited Talk - AI Governance and Unlearning
Nvidia, Santa Clara, CA
2024
Keynote Speaker - Personalization of Generative AI Systems
EACL 2024, Malta
2024
Invited Panelist - Dialogue on Digital Trust and Safe AI 2024
NTU Singapore-Europe Dialogue, Singapore
2024
Invited Talk - Mechanistic Interpretability for AI Safety
Digital Trust Centre (SAISI), NTU Singapore
2024
Invited Talk - Unlearning and Relearning in LLMs
Digital Trust Centre (SAISI), NTU Singapore
2024
Invited Talk - AGI: Safety & Security Workshop
Foresight Institute, San Francisco, USA
2024
Invited Panelist - Mechanistic Interpretability
ICLR 2024, Vienna, Austria
2024
Invited Talk - Introduction to AI safety: Can we remove undesired behaviour from AI?
KAIST, South Korea
2023
Invited Talk - Interpretability for Safety and Alignment
Foresight Institute, USA

Selected Awards and Honors