fb2017.jpg

Hi, I'm Fazl Barez

AI safety and Interpretability researcher.

I’m a Research Fellow at Torr Vision Group (TVG), University of Oxford, where I lead safety research. Additionally, I’m a Research Advisor at Apart Research. I also hold affiliations with Kruger AI Safety Lab (KASL), the Centre for the Study of Existential Risk at University of Cambridge and Future of Life Institute.

Previously, I worked as Technology and Security Policy Fellow at RAND, on Interpretability at Amazon and The DataLab, safe recommender systems at Huawei, and on building a finance tool for budget management for economic scenario forecasting at Natwest Group.

Jun 20, 2024 New paper!🤖 Investigating Reward Tampering SYCOPHANCY TO SUBTERFUGE: INVESTIGATING REWARD TAMPERING IN LANGUAGE MODELS
May 15, 2024
Our paper on how LLMs relearn removed concepts has been accepted at ACL 2024🎉 See you in Bangkok 🇹🇭
May 13, 2024 I gave a talk about Machine Unlearning at Foresights AGI workshop 🇺🇸
May 2, 2024 Two papers accepted at ICML 2024 and excited to be co-organising the first MI workshop in Vienna ❤️: ICML 2024
Apr 18, 2024 Excited to be serving as a Programme Committee at ECAI 2024 🇪🇸: 27th European Conference on Artificial Intelligence

Research Interests


My research focuses on ensuring AI systems are safe, reliable, and beneficial as they grow more advanced. I work on techniques including but not limited to:

I take an interdisciplinary approach, drawing inspiration from fields like neuroscience and philosophy to enrich techniques for safe and beneficial AI. I openly share works on these topics to move AI safety progress forward through rigorous research and facilitate positive real-world impact.

I am always eager to discuss ideas or opportunities for collaboration. Please feel free to send me an email if you would like to connect!

Publications

* = Equal Contribution

For a more complete list, visit my Google Scholar

2023

  1. paper_13121_main.jpg
    Understanding Addition in Transformers
    Philip Quirke, and Fazl Barez
  2. paper_15241.jpg
    Measuring Value Alignment
    Fazl Barez, and Philip Torr
  3. paper_alan_turing.jpg
    The Alan Turing Institute’s response to the House of Lords Large Language Models Call for Evidence
    Fazl Barez, Philip H. S. Torr, Aleksandar Petrov, Carolyn Ashurst, Jennifer Ding, and 22 more authors
  4. paper_15507_main.png
    The Larger they are, the Harder they Fail: Language Models do not Recognize Identifier Swaps in Python
    Antonio Valerio Miceli Barone*, Fazl Barez*, Ioannis Konstas, and Shay B Cohen
  5. paper_19911_main.jpeg
    Neuron to Graph: Interpreting Language Model Neurons at Scale
    Alex Foote*, Neel Nanda, Esben Kran, Ioannis Konstas, Shay Cohen, and 1 more author
  6. paper_17553_2.jpeg
    Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark
    Jason Hoelscher-Obermaier*, Julia Persson*, Esben Kran, Ioannis Konstas, and Fazl Barez*
  7. paper_09826.jpeg
    Fairness in AI and Its Long-Term Implications on Society
    Ondrej Bohdal*, Timothy Hospedales, Philip H. S. Torr, and Fazl Barez*

2021

  1. paper_2021_2.jpeg
    Discovering topics and trends in the UK Government web archive
    David Beavan, Fazl Barez, M Bel, John Fitzgerald, Eirini Goudarouli, and 2 more authors

Keynotes and Invited Talks

Invited Talk - Foresight - AGI: Safety & Security Workshop (San Francisco 2024, USA)

2024 - 🔗 Info 📄 Slides 🎥 Presentation

Invited Panelist - Mechanistic Interpretability (ICLR 2024, Austria)

2024 - 🔗 Info

Invited Talk - Technical AI Safety Conference 2024 (TAIS Conference, Japan)

2024 - 🔗 Info - 🎥 Presentation

Invited Talk - Introduction to AI safety: Can we remove undesired behaviour from AI? (KAIST, South Korea)

2024 - 🔗 Info - 📄 Slides

Keynote Speaker - Personalization of Generative AI Systems (EACL 2024, Malta)

2024 - 🔗 Info

Invited Talk - Interpretability for Safety and Alignment (Foresight Institute, USA)

2023 - 🔗 Info - 🎥 Presentation


Selected Awards and Honors