Fazl Barez

Mostly human.
Sometimes researcher.
Always curious.

I'm a Senior Research Fellow at the University of Oxford leading research on Technical AI Safety and Governance. My research bridges technical innovation and real-world impact - from developing algorithms that major AI labs adopt, to shaping policy responses at the highest levels of government. I currently serve as an advisor to Martian. I'm interested in commercializing interpretability research through startups to solve real-world problems.

My contributions include leading research with the UK AI Security Institute on machine unlearning for AI safety and developing the N2G algorithm adopted by OpenAI to evaluate sparse autoencoders for interpretability. I also spearheaded the Alan Turing Institute's response to the UK House of Lords inquiry on large language models and worked with Anthropic's Alignment team (2024-2025) on studies investigating deception and reward hacking, among other topics.

My research is funded by OpenAI, Anthropic, Schmidt Sciences, Future of Life Institute, NVIDIA, among others. I'm affiliated with Cambridge's CSER, NTU's Digital Trust Centre, Edinburgh's Informatics, and am a member of ELLIS. Previously, I was a researcher at Amazon and Huawei, and Co-director and Head of Research at Apart Research.

I'm looking for motivated people interested in AI Safety, Interpretability, and Technical AI Governance. I value collaborative work and am especially committed to partnering with researchers from underrepresented, marginalized, and otherwise disadvantaged backgrounds. If this resonates with you, please contact me at fazl[at]robots[dot]ox[dot]ac[dot]uk

News

Sep 2025

🎉 3 papers accepted at NeurIPS 2025! 🇩🇰🇲🇽🇺🇸

Aug 2025

🎉 4 papers accepted at EMNLP 2025 - see you all in China! 🇨🇳

Jul 2025

🏫 Thrilled to be speaking at the Human-aligned AI Summer School in Prague! 🇨🇿 Four intensive days of AI alignment research, July 22-25, 2025. 🚀 Applications are open!

Jul 2025

🎉 Excited to be a panelist at the Actionable Interpretability workshop at ICML 2025! 🧠✨ July 19th in Vancouver 🇨🇦

Jul 2025

🌟 Excited to speak at Post-AGI Civilizational Equilibria: Are there any good ones? in Vancouver! 🇨🇦 July 14th, 2025 (co-located with ICML) 🤖✨

Mar 2025

Invited talk on open problems in machine unlearning for AI safety at Ploutos.

Mar 2025

Invited to lecture on AI Safety and Alignment at Oxford Machine Learning Summer School in August.

Feb 2025

I am serving as an Area Chair for ACL 2025.

Jan 2025

Our paper Towards interpreting visual information processing in vision-language models accepted at ICLR 2025! See you in Singapore.

Nov 2024

Excited to speak about unlearning and Safety at the How To Evaluate AI Privacy Tutorial at NeurIPS 2024!🍁🇨🇦

Sep 2024

Our paper Interpreting LFPs in Large Language Models has been accepted at NeurIPS 2024! See you in Vancouver 🍁🇨🇦

Sep 2024

Two papers accepted at EMNLP 2024! See you in Miami ❤️

Aug 2024

Excited to be presenting our work on LLMs Relearn Removed Concepts at ACL 2024! See you all in Bangkok 🇹🇭!

Aug 2024

Co-organised the first workshop on Mechanistic Interpretability at ICML 2024! 🎉

Aug 2024

Presented 2 main confrence and 1 workshop paper at ICML 2024! 🎉

Jun 2024

New paper!🤖 Investigating Reward Tampering Sycophancy To Subterfuge: Investigating Reward Tampering In Language Models

May 2024

Our paper on how LLMs relearn removed concepts has been accepted at ACL 2024🎉

See you in Bangkok 🇹🇭

May 2024

I gave a talk about Machine Unlearning at Foresights AGI workshop 🇺🇸

May 2024

Two papers accepted at ICML 2024 and excited to be co-organising the first MI workshop in Vienna ❤️: ICML 2024

Apr 2024

Excited to be serving as a Programme Committee at ECAI 2024 🇪🇸: 27th European Conference on Artificial Intelligence

Apr 2024

Invited speaker at AI Safety Colloquium at KIAST, South Korea 🇰🇷: Introduction to AI safety: Can we remove undesired behaviour from AI?

Mar 2024

Gave a keynote at EACL 2024 Personalization Workshop in Malta 🇲🇹: PERSONALIZE @ EACL 2024

Jan 2024

Our paper has been acceoted at ICLR 2024! 🤗 See you In Vienna 🇦🇹: Understanding Addition In Transformers

Jan 2024

New Paper 🎉: Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Jan 2024

New Paper 🎉: Can language models relearn removed concepts?

Dec 2023

Presented Measuring Value Alignment at NeurIPS 2023, New Orleans 🗽: Presentation

Chain-of-thought (CoT) prompting has emerged as a powerful technique for improving the reasoning capabilities of large language models (LLMs). However, there is a growing misconception that CoT outputs provide genuine explanations of the model’s reasoning process. This paper argues that CoT is not explainability in the traditional sense and discusses the implications of this distinction. We examine the differences between CoT outputs and true explanations, highlighting how CoT may actually obscure rather than illuminate the model’s internal reasoning mechanisms. Our analysis reveals that while CoT can improve performance on various tasks, it does not necessarily provide insight into how the model arrives at its conclusions. We discuss the risks of treating CoT as explainability and call for more rigorous approaches to understanding LLM reasoning processes.

Artificial-intelligence systems built with statistical machine learning have become the operating system of contemporary surveillance and information control, spanning both physical and online spaces. City-scale face-recognition grids, real-time social-media takedown engines and predictive "pre-crime" dashboards share four politically relevant technical features: massive data ingestion, black-box inference, automated decision-making, and no human in the loop. These features now amplify authoritarian power and erode liberal-democratic norms across many political regimes. Yet mainstream machine learning research still devotes only limited attention to technical safeguards such as differential privacy, federated-learning security and large-model interpretability, or adversarial methods that can help the public resist AI-enhanced domination. We identify four resulting gaps and propose re-directing the field toward a triad of safeguards—privacy preservation, formal interpretability and adversarial user tooling—and outline concrete research directions that fit within standard ML practice. Shifting community priorities toward Explainable-by-Design, Privacy-by-Default is a pre-condition for any durable defense of liberal democracy.

As AI systems become more capable, widely deployed, and increasingly autonomous in critical areas such as cybersecurity, biological research, and healthcare, ensuring their safety and alignment with human values is paramount. Machine unlearning – the ability to selectively forget or suppress specific types of knowledge – has shown promise for privacy and data removal tasks, which has been the primary focus of existing research. More recently, its potential application to AI safety has gained attention. In this paper, we identify key limitations that prevent unlearning from serving as a comprehensive solution for AI safety, particularly in managing dual-use knowledge in sensitive domains like cybersecurity and chemical, biological, radiological, and nuclear (CBRN) safety. In these contexts, information can be both beneficial and harmful, and models may combine seemingly harmless information for harmful purposes – unlearning this information could strongly affect beneficial uses. We provide an overview of inherent constraints and open problems, including the broader side effects of unlearning dangerous knowledge, as well as previously unexplored tensions between unlearning and existing safety mechanisms. Finally, we investigate challenges related to evaluation, robustness, and the preservation of safety features during unlearning. By mapping these limitations and open challenges, we aim to guide future research toward realistic applications of unlearning within a broader AI safety framework, acknowledging its limitations and highlighting areas where alternative approaches may be required.

Fazl Barez

News

Research Overview

Interpretability

Safety and Alignment

Technical Governance

Societal Impact

Selected Publications

2025

2024

Teaching & Mentoring

Teaching Experience

Past Mentees

Keynotes and Invited Talks

Selected Awards and Honors