Universality is a key hypothesis in mechanistic interpretability – that
...
Neural networks often exhibit emergent behavior, where qualitatively new...
Current language models are considered to have sub-human capabilities at...
In the future, powerful AI systems may be deployed in high-stakes settin...
Assuming humans are (approximately) rational enables robots to infer rew...
Many robotics domains use some form of nonconvex model predictive contro...
Inverse reinforcement learning (IRL) is a common technique for inferring...
Learning preferences implicit in the choices humans make is a well studi...