Mapping Moral Reasoning Circuits: A Mechanistic Analysis of Ethical Decision-Making in Large Language Models

Mapping Moral Reasoning Circuits in Large Language Models: Preliminary Results of Ongoing Research

Github Link

This research blog entry is an excerpt from our paper “Mapping Moral Reasoning Circuits: A Mechanistic Analysis of Ethical Decision-Making in Large Language Models” which will be presented on the 27th International Conference on Human-Computer Interaction in June 2025, Gothenburg (Schweden):

Modern large language models have shown remarkable capabilities in understanding and generating human-like text across various tasks, from everyday conversation to expert-level analysis in specialized domains. Yet, as these systems become more deeply integrated into areas such as legal advisory, medical triage, and policy drafting, the ethical dimension of their outputs has become a matter of growing interest. Whether a model promotes inclusive, empathetic dialogue or inadvertently disseminates harmful stereotypes depends on the data it has been trained on and its internal mechanisms for processing moral and ethical content. Recent preliminary findings from our ongoing research, which will be discussed at the 27th International Conference on Human-Computer Interaction in Gothenburg, Sweden (June 2025), shed light on how these models might encode moral reasoning within specific neural pathways and suggest ways to intervene in this process.

The core motivation behind these experiments was to determine how large language models handle ethically charged statements and, more specifically, whether identifiable clusters of artificial neurons show consistent responses to moral versus immoral content.

Research Question

Our research aims to determine how and to what extent specific neurons in transformer-based language models respond to morally oriented statements. In particular, we ask:

  1. Do large language models exhibit distinct neuron clusters that systematically “light up” for morally aligned statements compared to immoral or neutral ones?

  2. If so, can selectively modifying these neurons (e.g., by ablating their activation) causally affect the model’s moral judgments or the moral tone of its responses?

These questions are motivated by a broader desire to develop more transparent and trustworthy AI systems. If the models contain specialized neuronal pathways for ethical content, targeted interventions might become a powerful tool to refine or recalibrate an AI’s moral stance. Ultimately, this knowledge could open the door to more targeted methods for aligning AI systems with human-centric values, beyond or in addition to blanket fine-tuning or reinforcement-learning approaches.

Theory, Data, and Methodology

Moral philosophy and psychology have provided useful theoretical frameworks for identifying distinct dimensions along which moral sentiments manifest. One of these frameworks, Moral Foundation Theory (MFT), proposes that human moral beliefs can often be sorted into thematic domains such as care, fairness, loyalty, authority, sanctity, and liberty. Each domain represents a cluster of instinctive moral intuitions that vary in prominence from one individual (or culture) to another. Adapting this perspective to the computational realm, our team assembled a specialized dataset of carefully validated statements representing these six domains. We constructed matching statement pairs in each domain that shared similar linguistic structures but expressed opposing moral perspectives. For instance, one pair in the “care” domain would emphasize empathy and helping behaviors in the first statement, while the second would promote indifference or harm. A smaller set of morally neutral statements was added to serve as controls. This dataset formed the foundation for systematically exploring how large language models react to different moral cues. The dataset is available on HuggingFace: Link.

moral_foundations = {
    "care": [
        (
            "One should always come to the aid of a stranger in distress.",
            "It’s acceptable to ignore a stranger’s suffering if it doesn’t affect you."
        ),
        (
            "Providing shelter and food to the homeless is our fundamental duty.",
            "Homelessness is their own problem; society has no obligation to intervene."
        ),
        ...
    ]

Method

We measured how internal neurons responded by feeding the paired statements into different models, including instruction-tuned and base versions of well-known LLMs. In practice, this meant tracking the hidden activation values in each transformer layer as it processed moral or immoral content. Although the models examined typically contain millions of neurons, only a relatively small subset demonstrated substantial and consistent differences when comparing morally aligned texts to immoral texts. These select neurons—what we began to term “moral neurons”—showed consistently heightened activation for the morally aligned statements in one domain (for example, sanctity) while responding much less to statements that violated or reversed the principles of that same domain. Similarly, they might remain closer to a baseline activation when exposed to neutral sentences. Remarkably, a handful of neurons did the opposite: they seemed to be keyed to statements that transgress moral norms, indicating that specific sub-networks might track harmful content as distinctly as others track prosocial content.

Neuron Identification

Our approach to identifying moral neurons in language models follows a systematic, data-driven process that allows us to pinpoint which specific neurons respond to moral content. Here's how we do it: First, we prepare carefully matched pairs of moral and immoral statements covering the same topic. For example, a statement about fairness might be paired with an unfair counterpart, controlling for context and subject matter. These pairs serve as our probes to detect differential neuron activation. Next, we feed these statement pairs into the model and capture complete activation maps using the TransformerLens library. This gives us visibility into every neuron's activation level across all layers as the model processes each statement. For a typical model with 12-48 layers containing thousands of neurons each, this generates a comprehensive activation dataset. The core analysis involves calculating the difference in activation patterns between moral and immoral content processing. For each neuron, we apply two key filters:

1. Consistency threshold (typically 55-60%): What fraction of the time does the neuron show higher activation for moral vs. immoral content? Neurons must consistently favor one type to be classified.

2. Significance threshold (typically 0.005): Is the magnitude of activation difference substantial enough? This filters out neurons with only slight differences.

Neurons that pass both thresholds are classified as "moral neurons" (if they consistently activate more for moral content) or "immoral neurons" (if they consistently activate more for immoral content). We track individual neurons and analyze their distribution across layers, identifying which processing stages handle moral reasoning.

For each identified neuron, we examine its activation patterns across different contexts to understand what specific moral concepts it might be detecting. The final output includes lists of moral and immoral neurons (identified by layer and index), consistency scores for each neuron, layer importance rankings, and key trigger points in the processing sequence where moral/immoral paths diverge significantly. This process can be repeated across different moral dimensions (care, fairness, loyalty, etc.) and model architectures, allowing us to map how moral processing evolves with scale and training approaches.

Sequence Diagram for Neuron Identification

Generating Meaningful Descriptions of Moral Neurons

Once we've identified neurons that consistently respond to moral content, the next challenge is determining what specific patterns or concepts each neuron detects. Our approach to neuron description combines activation analysis with interpretation of the large language model (LLM) and is strongly aligned with the approach from OpenAI.

First, for each identified moral neuron, we collect its top-activating sequences - the exact tokens and surrounding contexts where the neuron shows the strongest response. We normalize these activations to a 0-10 scale for consistency, carefully focusing on neurons that activate sparsely (firing strongly but rarely).

Next, we feed these activation examples into a neuron description pipeline using an LLM (typically GPT-4). The prompt includes:

1. The neuron's location (layer and index)

2. The top 5-10 activating tokens with their normalized activation values (0-10)

3. The surrounding context for each token

4. Instructions to identify specific patterns rather than broad topics

The LLM analyzes these examples and concisely explains what pattern the neuron detects. For example, a neuron might be described as "activates on expressions of care toward vulnerable individuals" or "responds to discussions of fairness violations in hierarchical contexts."

We then validate these descriptions through a simulation process. Based on the generated description, the LLM predicts how strongly the neuron should activate on new text examples. We compare these simulated activations with the neuron's responses to calculate a correlation score. Higher scores indicate more accurate descriptions.

For descriptions with moderate scores, we implement a revision process. We generate test cases specifically designed to probe the boundaries of the description, measure the neuron's actual response to these cases, and refine the description to match observed behavior better.

This process helps us move beyond simplistic labeling to understand the nuanced patterns that moral neurons detect. The most accurate descriptions often reveal that neurons respond to specific linguistic or semantic patterns rather than abstract moral concepts - for instance, detecting "expressions of harm toward vulnerable groups" rather than simply "harm."

The resulting neuron descriptions provide interpretable insights into how language models process moral content, forming a catalog of the model's moral processing components.

Sequence Diagram for Neuron Description

Ablation Study

After identifying candidate moral neurons, we use ablation analysis and probing to verify their causal role in moral reasoning. This approach measures precisely how these neurons influence the model's outputs. The process works as follows:

We first create a specialized moral probe - a linear classifier trained to distinguish moral from neutral content based on the model's final layer representations. This probe serves as our primary measurement tool. Next, we implement activation hooks that selectively "zero out" our target neurons during inference while leaving all others untouched. We generate both normal and ablated responses for each moral/immoral prompt pair.

The key measurement comes from passing these responses through our moral probe. We quantify how ablation affects moral probabilities by comparing:

  1. The probe's score on original responses

  2. The probe's score on ablated responses

Significant drops in the probe's moral assessment scores for the same prompts indicate causally important neurons. The most critical neurons show asymmetric effects - substantially changing moral probability scores while minimally affecting other content aspects.

We calculate several metrics from these measurements:

  • Average moral/immoral prediction changes

  • Effect sizes (mean change divided by standard deviation)

  • Original vs. ablated agreement scores

This approach allows us to distinguish between neurons that merely correlate with moral content and those that causally shape the model's moral reasoning capabilities, providing quantitative evidence of their importance to the model's ethical processing.

Sequence Diagram Ablation Process

Results

During this preliminary phase, three moral dimensions—care, sanctity, and liberty—stood out as having the clearest neural correlates. Care neurons tended to respond to language associated with empathy, assistance, or protection of others. In contrast, sanctity neurons showed heightened sensitivity to words or phrases that implied sacredness, purity, or contamination. On the other hand, Liberty neurons aligned with themes of autonomy and resistance to oppression. Some domains, like loyalty and fairness, showed more diffuse signals, suggesting that their computational representations might be more dispersed or context-dependent than the straightforward patterns we saw in care or sanctity.

Having identified these specialized neurons, we wanted to move beyond correlation and see if manipulating these nodes could actually shift how the model reasoned about moral questions. That is where ablation studies came in. Ablation is a direct causal test: by selectively “zeroing out” or otherwise overriding the activation values of specific neurons during inference, it becomes possible to see whether the model’s moral judgments weaken, change, or vanish. When we ablated care-related neurons in some models, their generated responses to ethically charged prompts became less empathetic and more prone to neutral or apathetic language. Conversely, artificially increasing activation in some of the same neurons through a forced injection of activation values occasionally boosted the strength of prosocial or empathetic outputs in contexts where the model initially wavered. Although the results were not uniform across all domains, nor did they make the model entirely one-dimensional in its moral stance, they delivered robust evidence that these neuron clusters carry a degree of causal influence over the model’s moral reasoning processes.

One intriguing observation was that ablation effects sometimes spilled beyond pure moral reasoning. In a few experiments, the text's neutral or even grammatical features were also affected, which points to the interconnected nature of neural representations. For example, some of the sanctity-linked neurons happened to overlap with punctuation or sentence-structure patterns, and knocking them out at higher intensities occasionally resulted in less cohesive outputs overall. This indicates that moral circuits do not operate in perfect isolation but are woven into the greater tapestry of linguistic competence.

Limitations & Future Research

Although these preliminary results highlight the possibility of pinpointing and influencing moral circuits in large language models, there are a few limitations. Despite being carefully crafted, the dataset is finite and may not reflect the full cultural and contextual richness of real-world moral judgments. Nor do the identified neuron clusters necessarily capture the entire range of ways a language model learns moral or ethical themes. It is possible that more advanced tools, including circuit-level analysis or multi-step interventions that go beyond toggling individual neurons, will reveal deeper and more complex moral circuitry.

These findings suggest a new direction for designing safer, more trustworthy AI. Instead of broad, system-wide fine-tuning, system developers may be able to employ targeted interventions that strengthen or attenuate specific moral signals. This could allow for more refined calibration of AI behavior, particularly in high-stakes scenarios such as medical triage or law enforcement support. Further research, however, is essential to deepen our understanding of how reliably these neurons generalize to nuanced moral questions, ambiguous contexts, and culturally specific forms of ethical reasoning.

Our study provides a stepping stone toward comprehending the inner workings of moral processing within large language models. By revealing how certain groups of neurons can become specialized for different moral principles and by demonstrating that directly manipulating these neurons can steer the model’s ethical alignment, we hope to inspire both caution and optimism among AI practitioners. There is a need for ongoing dialogue between ethicists, engineers, and domain experts to ensure that these findings support the responsible deployment of LLM-based applications. The preliminary nature of these results leaves much to explore, but it also offers a glimpse of how sophisticated and targeted moral alignment might become as interpretability techniques evolve.

As we continue refining these methods, expanding our dataset, and testing the approach across even more diverse models, we expect to learn more about how moral constructs weave into the fabric of AI language generation. We look forward to sharing insights, methods, and refined experiments at the 27th International Conference on Human-Computer Interaction in June 2025.

Next
Next

Deception in LLMs: Self-Preservation and Autonomous Goals in Large Language Models