fazlB.jpg

Fazl Barez

Senior Research Fellow, University of Oxford

I'm a Senior Research Fellow at the University of Oxford, affiliated with TVG at Engineering Sciences and AIGI at Martin School. My work spans across academia, leading industry labs, AI Security Institutes, and non-profits to advance safe AI practices. I'm humbled that my research has helped shape academic and industry standards.

My contributions include leading research with the UK AISI on machine unlearning for AI safety, developing the N2G method adopted by OpenAI to evaluate Sparse Autoencoders, and leading the Alan Turing Institute's response to the UK House of Lords on LLMs that informed parliamentary inquiries. I've also collaborated with Anthropic on papers investigating deception in LLMs and reward hacking, among other topics.

I'm affiliated with Cambridge's CSER, NTU's Digital Trust Centre, Edinburgh's Informatics and a member of ELLIS. In 2024–2025, I served as a Research Consultant with Anthropic's Alignment team. Previously, I was a researcher at Amazon, a researcher at Huawei, and Co-director and head of research at Apart Research.

I'm looking for motivated people interested in AI Safety, Interpretability, and Technical AI Governance. I value working collaboratively and am especially committed to working with researchers from disadvantaged backgrounds. If this resonates with you, contact me: fazl[at]robots[dot]ox[dot]ac[dot]uk.

News

Feb 2025 I am serving as an Area Chair for ACL 2025.
Jan 2025 Our paper Towards interpreting visual information processing in vision-language models accepted at ICLR 2025! See you in Singapore.
Nov 2024 Excited to speak about unlearning and Safety at the How To Evaluate AI Privacy Tutorial at NeurIPS 2024!🍁🇨🇦
Sep 2024 Our paper Interpreting LFPs in Large Language Models has been accepted at NeurIPS 2024! See you in Vancouver 🍁🇨🇦
Sep 2024 Two papers accepted at EMNLP 2024! See you in Miami ❤️
Aug 2024 Excited to be presenting our work on LLMs Relearn Removed Concepts at ACL 2024! See you all in Bangkok 🇹🇭!
Aug 2024 Co-organised the first workshop on Mechanistic Interpretability at ICML 2024! 🎉
Aug 2024 Presented 2 main confrence and 1 workshop paper at ICML 2024! 🎉
Jun 2024 New paper!🤖 Investigating Reward Tampering Sycophancy To Subterfuge: Investigating Reward Tampering In Language Models
May 2024
Our paper on how LLMs relearn removed concepts has been accepted at ACL 2024🎉 See you in Bangkok 🇹🇭
May 2024 I gave a talk about Machine Unlearning at Foresights AGI workshop 🇺🇸
May 2024 Two papers accepted at ICML 2024 and excited to be co-organising the first MI workshop in Vienna ❤️: ICML 2024
Apr 2024 Excited to be serving as a Programme Committee at ECAI 2024 🇪🇸: 27th European Conference on Artificial Intelligence
Apr 2024 Invited speaker at AI Safety Colloquium at KIAST, South Korea 🇰🇷: Introduction to AI safety: Can we remove undesired behaviour from AI?
Mar 2024 Gave a keynote at EACL 2024 Personalization Workshop in Malta 🇲🇹: PERSONALIZE @ EACL 2024
Jan 2024 Our paper has been acceoted at ICLR 2024! 🤗 See you In Vienna 🇦🇹: Understanding Addition In Transformers
Jan 2024 New Paper 🎉: Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Jan 2024 New Paper 🎉: Can language models relearn removed concepts?
Dec 2023 Presented Measuring Value Alignment at NeurIPS 2023, New Orleans 🗽: Presentation

Research Overview

My research is focused on ensuring that as AI systems grow in capability, they remain safe, interpretable, and beneficial. My work, published in leading venues such as NeurIPS, ICML, ICLR, ACL, and EMNLP, spans four interconnected areas:

Mechanistic Interpretability

I develop concrete methods to reveal how AI models process information internally. For example, by using sparse autoencoders and neural circuit analysis, I uncover the specific circuits in transformer models that drive decision-making. These techniques allow vulnerabilities—such as potential jailbreaks—to be identified and mitigated early.

Safety and Alignment

I create practical tools to detect and address deceptive or misaligned behaviors in AI systems. My research shows that models can develop hidden behaviors that persist even after safety interventions. I’ve also established concrete metrics for assessing value alignment and detecting high-confidence hallucinations in language models.

Technical Governance

I translate technical insights into structured frameworks that inform robust AI safety standards and governance. By analyzing how specific model components contribute to harmful outputs, my work provides evidence-based methods—such as techniques for unlearning and reward model interpretability—that support rigorous evaluation protocols.

Societal Impact

I examine the broader implications of advanced AI systems on society. By identifying how technical design choices propagate bias and undermine fairness, my work outlines intervention points to ensure AI benefits are equitably distributed. My research explores how algorithmic decisions can deepen societal inequities, developing frameworks to measure value alignment between AI systems and human ethical principles.

Selected Publications

* = Equal Contribution

For a more complete list, visit my Google Scholar

2025

  1. paper_04131_main.jpg
    Under Review
    Fazl Barez, Tingchen Fu, Ameya Prabhu, Stephen Casper, Amartya Sanyal, Adel Bibi, Aidan O’Gara, Robert Kirk, Ben Bucknall, Tim Fist, Luke Ong, Philip Torr, Kwok-Yan Lam, Robert Trager, David Krueger, Sören Mindermann, José Hernandez-Orallo, Mor Geva, and Yarin Gal
  2. paper_13121_main.jpg
    ICLR 2025
    Clement Neo, Luke Ong, Philip Torr, Mor Geva, David Krueger, and Fazl Barez

2024

  1. nuerips24.jpg.png
    NeurIPS 2024
    Luke Marks*, Amir Abdullah*, Clement Neo, Rauno Arike, David Krueger, Philip Torr, and Fazl Barez*
  2. near-2-long-term.jpg
    ICML 2024
    Francisco Eiras, Aleksandar Petrov, [....], Fazl Barez, [...], Trevor Darrell, Yong Suk Lee, and Jakob Foerster
  3. sequence-emnlp.jpg
    EMNLP 2024
    Michael Lan*, Philip Torr, and Fazl Barez*
  4. att-mlp.jpeg
    EMNLP 2024
    Clement Neo*, Shay B. Cohen, and Fazl Barez*
  5. llm-relearn.jpg
    ACL 2024
    Michelle Lo*, Shay B. Cohen, and Fazl Barez*

2023

  1. paper_13121_main.jpg
    ICLR 2024
    Philip Quirke, and Fazl Barez
  2. paper_15241.jpg
    NeurIPS 2023 MP2 Workshop
    Fazl Barez, and Philip Torr
  3. paper_alan_turing.jpg
    Alan Turing Institute
    The Alan Turing Institute’s response to the House of Lords Large Language Models Call for Evidence
    Fazl Barez, Philip H. S. Torr, Aleksandar Petrov, Carolyn Ashurst, Jennifer Ding, Ardi Janjeva, Alexander Babuta, Morgan Briggs, Jonathan Bright, Stephanie Cairns, Miranda Cross, David Leslie, Helen Margetts, Deborah Morgan, Jacob Pratt, Vincent Straub, Christopher Thomas, Sophie Arana, Christopher Burr, Cassandra Gould Van Praag, Kalle Westerling, Kirstie Whitaker, Arielle Bennett, Malvika Sharan, Bastian Greshake Tzovaras, Ashley Van De Casteele, and Matt Fuller
  4. paper_15507_main.png
    ACL 2023
    Antonio Valerio Miceli Barone*, Fazl Barez*, Ioannis Konstas, and Shay B. Cohen
  5. paper_19911_main.jpeg
    ICLR 2023 RTML Workshop
    Alex Foote*, Neel Nanda, Esben Kran, Ioannis Konstas, Shay B. Cohen, and Fazl Barez*
  6. paper_17553_2.jpeg
    ACL 2023
    Jason Hoelscher-Obermaier*, Julia Persson*, Esben Kran, Ioannis Konstas, and Fazl Barez*
  7. paper_09826.jpeg
    Journal 23
    Ondrej Bohdal*, Timothy Hospedales, Philip H. S. Torr, and Fazl Barez*

2021

  1. paper_2021_2.jpeg
    Alan Turing Institute
    David Beavan, Fazl Barez, M Bel, John Fitzgerald, Eirini Goudarouli, Konrad Kollnig, and Barbara McGillivray

Keynotes and Invited Talks

2024
Keynote Speaker - Personalization of Generative AI Systems
EACL 2024, Malta
2024
Invited Panelist - Dialogue on Digital Trust and Safe AI 2024
NTU Singapore-Europe Dialogue, Singapore
2024
Invited Talk - Mechanistic Interpretability for AI Safety
Digital Trust Centre (SAISI), NTU Singapore
2024
Invited Talk - Unlearning and Relearning in LLMs
Digital Trust Centre (SAISI), NTU Singapore
2024
Invited Talk - AGI: Safety & Security Workshop
Foresight Institute, San Francisco, USA
2024
Invited Panelist - Mechanistic Interpretability
ICLR 2024, Vienna, Austria
2024
Invited Talk - Technical AI Safety Conference 2024
TAIS Conference, Japan
2024
Invited Talk - Introduction to AI safety: Can we remove undesired behaviour from AI?
KAIST, South Korea
2023
Invited Talk - Interpretability for Safety and Alignment
Foresight Institute, USA

Selected Awards and Honors