Research

Mechanistic Interpretability Sigurd Schacht Mechanistic Interpretability Sigurd Schacht

Mapping Moral Reasoning Circuits: A Mechanistic Analysis of Ethical Decision-Making in Large Language Models

In our ongoing research, we investigate how large language models process moral decisions at a neural level. Through careful analysis of activation patterns, we've identified specific "moral neurons" that consistently respond to ethical content and validated their causal role using ablation studies. Our preliminary findings suggest that models may encode moral principles in distinct neural pathways, opening new possibilities for targeted AI alignment approaches. This work, which we'll present at HCI 2025 in Gothenburg.

Read More