fazlB.jpg

Hi, I'm Fazl Barez

AI safety and Interpretability researcher.

I’m a Research Fellow at the University of Oxford, leading research in AI safety and interpretability. Through collaboration across academia, AGI labs, AI Safety Institutes, government organizations and public sector, I work to accelerate progress in AI safety research and translate technical insights into practical solutions that benefit society. My research has helped shape both academic directions and industry safety practices.

My academic affiliations include the Centre for the Study of Existential Risk (CSER) and Kruger AI Safety Lab (KASL) at University of Cambridge, the Digital Trust Centre at Nanyang Technological University, and the School of Informatics at University of Edinburgh. As a member of ELLIS European Laboratory for Learning and Intelligent Systems (ELLIS), I contribute to advancing AI safety research across Europe. I also serve as a Research consultant at Anthropic.

Prior to my current roles, I served as Co-Director and Advisor at Apart Research, Technology and Security Policy Fellow at RAND Corporation, developed interpretability solutions at Amazon and The DataLab, built recommender systems at Huawei, and created financial forecasting tools at RBS Group.

Nov 20, 2024 Excited to speak about unlearning and Safety at the How To Evaluate AI Privacy Tutorial at NeurIPS 2024!🍁🇨🇦
Sep 20, 2024 Our paper Interpreting LFPs in Large Language Models has been accepted at NeurIPS 2024! See you in Vancouver 🍁🇨🇦
Sep 20, 2024 Two papers accepted at EMNLP 2024! See you in Miami ❤️
Aug 11, 2024 Excited to be presenting our work on LLMs Relearn Removed Concepts at ACL 2024! See you all in Bangkok 🇹🇭!
Aug 10, 2024 Co-organised the first workshop on Mechanistic Interpretability at ICML 2024! 🎉

Research Interests


My research focuses on ensuring AI systems are safe, reliable, and beneficial as they grow more advanced. I work on techniques including but not limited to:

My technical research in machine learning is enriched by engaging with insights from policy, cognitive science, and governance, as I work towards making AI systems more transparent, controllable, and reliably aligned with human values.

I am eager to discuss ideas or opportunities for collaboration. I am particularly committed to collaborating and mentoring researchers from disadvantaged backgrounds, if this is you, I strongly encourage you to reach out at: first_name[at]robots[dot]ox[dot]ac[dot]uk.

Selected Publications

* = Equal Contribution

For a more complete list, visit my Google Scholar

2024

  1. nuerips24.jpg.png
    Interpreting Learned Feedback Patterns in Large Language Models
    Luke Marks*, Amir Abdullah*, Clement Neo, Rauno Arike, Philip Krueger, and 1 more author
  2. near-2-long-term.jpg
    Position: Near to Mid-term Risks and Opportunities of Open-Source Generative AI
    Francisco Eiras, Aleksandar Petrov,  [....], Fazl Barez,  [...], and 3 more authors
  3. sequence-emnlp.jpg
    Towards Interpretable Sequence Continuation: Analyzing Shared Circuits in Large Language Models
    Michael Lan*, Philip Torr, and Fazl Barez*
  4. att-mlp.jpeg
    Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions
    Clement Neo*, Shay B. Cohen, and Fazl Barez*
  5. llm-relearn.jpg
    Large Language Models Relearn Removed Concepts
    Michelle Lo*, Shay B. Cohen, and Fazl Barez*

2023

  1. paper_13121_main.jpg
    Understanding Addition in Transformers
    Philip Quirke, and Fazl Barez
  2. paper_15241.jpg
    Measuring Value Alignment
    Fazl Barez, and Philip Torr
  3. paper_alan_turing.jpg
    The Alan Turing Institute’s response to the House of Lords Large Language Models Call for Evidence
    Fazl Barez, Philip H. S. Torr, Aleksandar Petrov, Carolyn Ashurst, Jennifer Ding, and 22 more authors
  4. paper_15507_main.png
    The Larger they are, the Harder they Fail: Language Models do not Recognize Identifier Swaps in Python
    Antonio Valerio Miceli Barone*, Fazl Barez*, Ioannis Konstas, and Shay B Cohen
  5. paper_19911_main.jpeg
    Neuron to Graph: Interpreting Language Model Neurons at Scale
    Alex Foote*, Neel Nanda, Esben Kran, Ioannis Konstas, Shay Cohen, and 1 more author
  6. paper_17553_2.jpeg
    Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark
    Jason Hoelscher-Obermaier*, Julia Persson*, Esben Kran, Ioannis Konstas, and Fazl Barez*
  7. paper_09826.jpeg
    Fairness in AI and Its Long-Term Implications on Society
    Ondrej Bohdal*, Timothy Hospedales, Philip H. S. Torr, and Fazl Barez*

2021

  1. paper_2021_2.jpeg
    Discovering topics and trends in the UK Government web archive
    David Beavan, Fazl Barez, M Bel, John Fitzgerald, Eirini Goudarouli, and 2 more authors

Keynotes and Invited Talks

Invited Panelist - Dialogue on Digital Trust and Safe AI 2024: Building ​bridges for a safe future (Singapore)

2024 - 🔗 Info

Invited Talk - Digital Trust Centre (Singapore AI Safety Institute) - Mechanistic Interpretability for AI Safety (NTU Singapore)

2024 🔗 Info

Invited Talk - Digital Trust Centre (Singapore AI Safety Institute) - Unlearning and Relearning in LLMs

2024 🔗 Info

Invited Talk - Foresight - AGI: Safety & Security Workshop (San Francisco 2024, USA)

2024 - 🔗 Info 📄 Slides 🎥 Presentation

Invited Panelist - Mechanistic Interpretability (ICLR 2024, Austria)

2024 - 🔗 Info

Invited Talk - Technical AI Safety Conference 2024 (TAIS Conference, Japan)

2024 - 🔗 Info - 🎥 Presentation

Invited Talk - Introduction to AI safety: Can we remove undesired behaviour from AI? (KAIST, South Korea)

2024 - 🔗 Info - 📄 Slides

Keynote Speaker - Personalization of Generative AI Systems (EACL 2024, Malta)

2024 - 🔗 Info

Invited Talk - Interpretability for Safety and Alignment (Foresight Institute, USA)

2023 - 🔗 Info - 🎥 Presentation


Selected Awards and Honors