Fazl Barez

Senior Research Fellow, University of Oxford

Anonymous feedback

I'm a Senior Research Fellow at the University of Oxford, affiliated with TVG at Engineering Sciences and AIGI at Martin School. My work spans across academia, leading industry labs, AI Security Institutes, and non-profits to advance safe AI practices. I'm humbled that my research has helped shape academic and industry standards.

My contributions include leading research with the UK AISI on machine unlearning for AI safety, developing the N2G algorithm adopted by OpenAI to evaluate Sparse Autoencoders, and leading the Alan Turing Institute's response to the UK House of Lords on LLMs that informed parliamentary inquiries. I've also collaborated with Anthropic on papers investigating deception in LLMs and reward hacking, among other topics.

I'm affiliated with Cambridge's CSER, NTU's Digital Trust Centre, Edinburgh's Informatics and a member of ELLIS. In 2024–2025, I served as a Research Consultant with Anthropic's Alignment team. Previously, I was a researcher at Amazon, a researcher at Huawei, and Co-director and head of research at Apart Research.

I'm looking for motivated people interested in AI Safety, Interpretability, and Technical AI Governance. I value collaborative work and am especially committed to partnering with researchers from underrepresented, marginalized, and otherwise disadvantaged backgrounds. If this resonates with you, please contact me at fazl[at]robots[dot]ox[dot]ac[dot]uk

News

Mar 2025

Invited talk on open problems in machine unlearning for AI safety at Ploutos.

Mar 2025

Invited to lecture on AI Safety and Alignment at Oxford Machine Learning Summer School in August.

Feb 2025

I am serving as an Area Chair for ACL 2025.

Jan 2025

Our paper Towards interpreting visual information processing in vision-language models accepted at ICLR 2025! See you in Singapore.

Nov 2024

Excited to speak about unlearning and Safety at the How To Evaluate AI Privacy Tutorial at NeurIPS 2024!🍁🇨🇦

Sep 2024

Our paper Interpreting LFPs in Large Language Models has been accepted at NeurIPS 2024! See you in Vancouver 🍁🇨🇦

Sep 2024

Two papers accepted at EMNLP 2024! See you in Miami ❤️

Aug 2024

Excited to be presenting our work on LLMs Relearn Removed Concepts at ACL 2024! See you all in Bangkok 🇹🇭!

Aug 2024

Co-organised the first workshop on Mechanistic Interpretability at ICML 2024! 🎉

Aug 2024

Presented 2 main confrence and 1 workshop paper at ICML 2024! 🎉

Jun 2024

New paper!🤖 Investigating Reward Tampering Sycophancy To Subterfuge: Investigating Reward Tampering In Language Models

May 2024

Our paper on how LLMs relearn removed concepts has been accepted at ACL 2024🎉

See you in Bangkok 🇹🇭

May 2024

I gave a talk about Machine Unlearning at Foresights AGI workshop 🇺🇸

May 2024

Two papers accepted at ICML 2024 and excited to be co-organising the first MI workshop in Vienna ❤️: ICML 2024

Apr 2024

Excited to be serving as a Programme Committee at ECAI 2024 🇪🇸: 27th European Conference on Artificial Intelligence

Apr 2024

Invited speaker at AI Safety Colloquium at KIAST, South Korea 🇰🇷: Introduction to AI safety: Can we remove undesired behaviour from AI?

Mar 2024

Gave a keynote at EACL 2024 Personalization Workshop in Malta 🇲🇹: PERSONALIZE @ EACL 2024

Jan 2024

Our paper has been acceoted at ICLR 2024! 🤗 See you In Vienna 🇦🇹: Understanding Addition In Transformers

Jan 2024

New Paper 🎉: Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Jan 2024

New Paper 🎉: Can language models relearn removed concepts?

Dec 2023

Presented Measuring Value Alignment at NeurIPS 2023, New Orleans 🗽: Presentation

As AI systems become more capable, widely deployed, and increasingly autonomous in critical areas such as cybersecurity, biological research, and healthcare, ensuring their safety and alignment with human values is paramount. Machine unlearning – the ability to selectively forget or suppress specific types of knowledge – has shown promise for privacy and data removal tasks, which has been the primary focus of existing research. More recently, its potential application to AI safety has gained attention. In this paper, we identify key limitations that prevent unlearning from serving as a comprehensive solution for AI safety, particularly in managing dual-use knowledge in sensitive domains like cybersecurity and chemical, biological, radiological, and nuclear (CBRN) safety. In these contexts, information can be both beneficial and harmful, and models may combine seemingly harmless information for harmful purposes – unlearning this information could strongly affect beneficial uses. We provide an overview of inherent constraints and open problems, including the broader side effects of unlearning dangerous knowledge, as well as previously unexplored tensions between unlearning and existing safety mechanisms. Finally, we investigate challenges related to evaluation, robustness, and the preservation of safety features during unlearning. By mapping these limitations and open challenges, we aim to guide future research toward realistic applications of unlearning within a broader AI safety framework, acknowledging its limitations and highlighting areas where alternative approaches may be required.

Vision-language models (VLMs) have shown remarkable capabilities in processing and reasoning about visual information, but understanding how they integrate visual information into language generation remains a significant challenge. In this paper, we propose a novel approach to interpret how visual information is processed in VLMs, focusing on identifying which visual concepts are recognized and how they influence the model’s outputs. Our method combines techniques from activation engineering and controlled generation to reveal the internal representations and decision-making processes of VLMs. We find that VLMs do not passively receive visual information but actively process it through a series of transformations that extract relevant features and integrate them with language understanding. Our approach provides insights into both the successes and limitations of current VLM architectures, offering a foundation for more interpretable and robust vision-language systems. This work represents an important step toward understanding the complex interplay between vision and language within these powerful models.

Reinforcement learning from human feedback (RLHF) is widely used to train large language models (LLMs). However, it is unclear whether LLMs accurately learn the underlying preferences in human feedback data. We coin the term Learned Feedback Pattern (LFP) for patterns in an LLM’s activations learned during RLHF that improve its performance on the fine-tuning task. We hypothesize that LLMs with LFPs accurately aligned to the fine-tuning feedback exhibit consistent activation patterns for outputs that would have received similar feedback during RLHF. To test this, we train probes to estimate the feedback signal implicit in the activations of a fine-tuned LLM. We then compare these estimates to the true feedback, measuring how accurate the LFPs are to the fine-tuning feedback. Our probes are trained on a condensed, sparse and interpretable representation of LLM activations, making it easier to correlate features of the input with our probe’s predictions. We validate our probes by comparing the neural features they correlate with positive feedback inputs against the features GPT-4 describes and classifies as related to LFPs. Understanding LFPs can help minimize discrepancies between LLM behavior and training objectives, which is essential for the safety of LLMs.