AI Safety and Alignment
October
13th–17th, 2025
Oxford
A 5-day intensive course on foundational concepts and frontier
research in the field of AI safety and alignment.
Topics:
- The alignment problem, foundations and present day.
- Frontier alignment methods and evaluation.
- Interpretability and monitoring.
- Sociotechnical aspects of AI alignment.
Lecturer: Fazl
Barez.
Teaching assistants: Hunar Batra, Matthew Farrugia-Roberts, James Oldfield, and Marta
Ziosi (Guest Speaker/Tutor).
Guest lecturers:
- Yoshua Bengio, Université de Montréal, Mila, &
LawZero.
- Neel Nanda, Google DeepMind.
- Joslyn Barnhart, Google DeepMind.
- Robert Trager, Oxford Martin AI Governance
Initiative.
Schedule:
- Lectures: 10:00–13:00 Monday–Thursday. LR7,
Information Engineering Building. Open to all Oxford
students. Recorded.
- Labs: 14:00–17:00 Monday–Thursday. Eagle House.
Reserved for AIMS CDT
students. Lab materials will be made available.
- Guest speaker day: 10:00–16:30 Friday. Location
TBD. Open to all Oxford students.
Prerequisites:
- Mathematical maturity with proficiency in probability, optimization,
and machine learning.
- Familiarity with neural networks, gradient descent, and deep
learning fundamentals.
- Python programming experience and ability to train basic neural
networks.
Syllabus
The alignment problem (Monday, October
13th):
- Introduction to the alignment problem.
- Outer and inner alignment.
- Deceptive alignment, sandbagging, and scheming.
- Lab: Specification gaming and goal misgeneralisation in grid
worlds.
Alignment methods and evaluation (Tuesday, October
14th):
- Frontier model training and alignment pipeline.
- Alignment methods: RLHF, constitutional AI, deliberative alignment,
weak-to-strong generalisation.
- Benchmarks, evaluation, and scalable oversight.
- Lab: RLHF and Constitutional AI.
Interpretability and monitoring (Wednesday, October
15th):
- Mechanistic interpretability.
- Scheming, sandbagging, and monitoring.
- Case study: Sleeper agents.
- Lab: Activation patching in a large language model.
Sociotechnical aspects of AI alignment (Thursday,
October 16th):
- Governance frameworks and coordination.
- Economic impacts of transformative AI.
- Security applications and dual-use risks.
- Lab: Policy practicum.
Guest speaker day (Friday, October
17th):
- Robert Trager (topic TBA).
- Joslyn Barnhart (topic TBA).
- Neel Nanda (topic TBA).
- Yoshua Bengio (topic TBA).
Assessment
To pass the module, AIMS students must:
- Participate in labs Monday–Thursday.
- Complete a 1,500 word report on a chosen alignment failure mode. The
report is to be completed during the labs (after completing the day’s
exercises) and outside of contact hours.
The deadline for the report is Monday, October 20th,
23:59.