Previously, I worked as Technology and Security Policy Fellow at RAND, on Interpretability at Amazon and The DataLab, safe recommender systems at Huawei, and on building a finance tool for budget management for economic scenario forecasting at Natwest Group.
Our paper on how LLMs relearn removed concepts has been accepted at ACL 2024🎉
See you in Bangkok 🇹🇭
Research Interests
My research focuses on ensuring AI systems are safe, reliable, and beneficial as they grow more advanced. I work on techniques including but not limited to:
Interpretability and explainability to reveal how AI systems function, make decisions, and process information.
Safety and alignment mechanisms so systems continue behaving reliably even as they become more autonomous.
Rigorous benchmarking and evaluation methodologies to deeply probe strengths, weaknesses, and failure modes.
Fairness, transparency, and modeling societal impacts to encourage responsible development and adoption.
I take an interdisciplinary approach, drawing inspiration from fields like neuroscience and philosophy to enrich techniques for safe and beneficial AI.
I openly share works on these topics to move AI safety progress forward through rigorous research and facilitate positive real-world impact.
I am always eager to discuss ideas or opportunities for collaboration. Please feel free to send me an email if you would like to connect!
This paper presents an in-depth analysis of a one-layer
Transformer model trained for n-digit integer addition. We reveal
that the model divides the task into parallel, digit-specific
streams and employs distinct algorithms for different digit
positions. Our study also finds that the model starts calculations
late but executes them rapidly. A rare use case with high loss is
identified and explained.
As artificial intelligence (AI) systems become increasingly
integrated into various domains, ensuring that they align with
human values becomes critical. This pa- per introduces a novel
formalism to quantify the alignment between AI systems and human
values, using Markov Decision Processes (MDPs) as the foundational
model. We delve into the concept of values as desirable goals tied
to actions and norms as behavioral guidelines, aiming to shed light
on how they can be used to guide AI decisions. This framework
offers a mechanism to evaluate the degree of alignment between
norms and values by assessing preference changes across state
transitions in a normative world. By utilizing this formalism, AI
developers and ethicists can better design and evaluate AI systems
to ensure they operate in har- mony with human values. The proposed
methodology holds potential for a wide range of applications, from
recommendation systems emphasizing well-being to autonomous
vehicles prioritizing safety.
The Alan Turing Institute’s response to the House of Lords Large
Language Models Call for Evidence
Fazl Barez, Philip H. S. Torr, Aleksandar Petrov, Carolyn Ashurst, Jennifer Ding, and
22 more authors
Large Language Models (LLMs) have successfully been applied to
code generation tasks, raising the question of how well these
models understand programming. Typical programming languages have
invariances and equivariances in their semantics that human
programmers intuitively understand and exploit, such as the (near)
invariance to the renaming of identifiers. We show that LLMs not
only fail to properly generate correct Python code when default
function names are swapped, but some of them even become more
confident in their incorrect predictions as the model size
increases, an instance of the recently discovered phenomenon of
Inverse Scaling, which runs contrary to the commonly observed trend
of increasing prediction quality with increasing model size. Our
findings indicate that, despite their astonishing typical-case
performance, LLMs still lack a deep, abstract understanding of the
content they manipulate, making them unsuitable for tasks that
statistically deviate from their training data, and that mere
scaling is not enough to achieve such capability.
Neuron to Graph: Interpreting Language Model Neurons at Scale
Alex Foote*, Neel Nanda, Esben Kran, Ioannis Konstas, Shay Cohen, and
1 more author
Advances in Large Language Models (LLMs) have led to remarkable
capabilities, yet their inner mechanisms remain largely unknown. To
understand these models, we need to unravel the functions of
individual neurons and their contribution to the network. This
paper introduces a novel automated approach designed to scale
interpretability techniques across a vast array of neurons within
LLMs, to make them more interpretable and ultimately safe.
Conventional methods require examination of examples with strong
neuron activation and manual identification of patterns to decipher
the concepts a neuron responds to. We propose Neuron to Graph (N2G)
, an innovative tool that automatically extracts a neuron’s
behaviour from the dataset it was trained on and translates it into
an interpretable graph. N2G uses truncation and saliency methods to
emphasise only the most pertinent tokens to a neuron while
enriching dataset examples with diverse samples to better encompass
the full spectrum of neuron behaviour.
Detecting Edit Failures In Large Language Models: An Improved
Specificity Benchmark
Jason Hoelscher-Obermaier*, Julia Persson*, Esben Kran, Ioannis Konstas, and Fazl Barez*
Recent model editing techniques promise to mitigate the problem of
memorizing false or outdated associations during LLM training.
However, we show that these techniques can introduce large unwanted
side effects which are not detected by existing specificity
benchmarks. We extend the existing CounterFact benchmark to include
a dynamic component and dub our benchmark CounterFact+.
Additionally, we extend the metrics used for measuring specificity
by a principled KL divergence-based metric. We use this improved
benchmark to evaluate recent model editing techniques and find that
they suffer from low specificity. Our findings highlight the need
for improved specificity benchmarks that identify and prevent
unwanted side effects.
Fairness in AI and Its Long-Term Implications on Society
Ondrej Bohdal*, Timothy Hospedales, Philip H. S. Torr, and Fazl Barez*
Successful deployment of artificial intelligence (AI) in various
settings has led to numerous positive outcomes for individuals and
society. However, AI systems have also been shown to harm parts of
the population due to biased predictions. AI fairness focuses on
mitigating such biases to ensure AI decision making is not
discriminatory towards certain groups. We take a closer look at AI
fairness and analyze how lack of AI fairness can lead to deepening
of biases over time and act as a social stressor.
2021
Discovering topics and trends in the UK Government web archive
David Beavan, Fazl Barez, M Bel, John Fitzgerald, Eirini Goudarouli, and
2 more authors
The challenge we address in this report is to make steps towards
improving search and discovery of resources within this vast
archive for future archive users, and how the UKGWA collection
could begin to be unlocked for research and experimentation by
approaching it as data (i.e. as a dataset at scale). The UKGWA has
begun to examine independently the usefulness of modelling the
hyperlinked structure of its collection for advanced corpus
exploration; the aim of this collaboration is to test algorithms
capable of searching for documents via the topics that they cover,
envisioning a future convergence of these two research frameworks.
This is a diachronic corpus that is ideal for studying the
emergence of topics and how they feature through government
websites over time, and it will indicate engagement priorities and
how these change over time.
Keynotes and Invited Talks
Invited Talk - Foresight - AGI: Safety & Security Workshop (San Francisco 2024, USA)