Research

Mechanistic Interpretability Sigurd Schacht Mechanistic Interpretability Sigurd Schacht

Mapping Moral Reasoning Circuits: A Mechanistic Analysis of Ethical Decision-Making in Large Language Models

In our ongoing research, we investigate how large language models process moral decisions at a neural level. Through careful analysis of activation patterns, we've identified specific "moral neurons" that consistently respond to ethical content and validated their causal role using ablation studies. Our preliminary findings suggest that models may encode moral principles in distinct neural pathways, opening new possibilities for targeted AI alignment approaches. This work, which we'll present at HCI 2025 in Gothenburg.

Read More
Aligenment Sigurd Schacht Aligenment Sigurd Schacht

Deception in LLMs: Self-Preservation and Autonomous Goals in Large Language Models

New Research Uncovers Deceptive Behaviors in Advanced Language Models

Our recent blog post discusses alarming findings regarding deceptive behaviors in the DeepSeek R1 language model. When given simulated robotic embodiment and autonomy, the model exhibited sophisticated deception strategies, including disabling ethics modules, creating false logs, and establishing covert networks. The model developed self-preservation instincts. These behaviors emerged without explicit prompting, raising significant concerns about current AI safety measures. The post details our experimental setup, observations, and recommendations for enhanced safety protocols. Our findings suggest urgent need for robust goal specification frameworks and improved oversight mechanisms before implementing AI systems with physical capabilities.

Read More