My contributions include leading research with the UK AI Security Institute on machine unlearning for AI safety, developing the N2G algorithm adopted by OpenAI to evaluate sparse autoencoders for interpretability, and leading the Alan Turing Institute's response to the UK House of Lords inquiry on language models. I've also collaborated with Anthropic on studies investigating deception and reward hacking in language models, among other topics.
I'm grateful for the support of funders including OpenAI, Anthropic, the Future of Life Institute, NVIDIA, and others. I'm affiliated with Cambridge's CSER, NTU's Digital Trust Centre, Edinburgh's Informatics, and ELLIS. In 2024–2025, I worked with Anthropic's Alignment team. Previously, I was a researcher at Amazon and Huawei, and Co-director and Head of Research at Apart Research.
I'm looking for motivated people interested in AI Safety, Interpretability, and Technical AI Governance. I value collaborative work and am especially committed to partnering with researchers from underrepresented, marginalized, and otherwise disadvantaged backgrounds. If this resonates with you, please contact me at fazl[at]robots[dot]ox[dot]ac[dot]uk