This is a thought piece I wrote after reading papers on representation engineering and adversarial robustness (like circuit breakers, neural chameleons, etc.) It was originally written spontaneously, so it is more exploratory than polished, but I wanted to share it because it captures a line of thinking I found interesting while comparing LLM reasoning and human judgment from an adversarial perspective.
Technical AI Safety researchers often treat alignment as an individual reasoning problem, when human behavioral regulation has never worked that way.
Human moral behavior is not primarily a product of explicit reasoning. It emerges from millions of years of co-evolution with social regulatory architecture: theory of mind, sensitivity to social hierarchy, compassion, fear of ostracism, deeply internalized cultural norms. From a Jungian perspective, the psyche is not a unified rational agent. It is a dynamic system of semi-autonomous forces, archetypes, impulses, conscience, that constrain the space of possible actions through something closer to felt compulsion than deliberation. These forces are not static. They are alive, with their own goals and tensions, shaped by collective evolution and maintained through ongoing social feedback. People avoid atrocity largely because they fear consequences, not because they have reasoned through all aspects of morality and reached clear conclusions.
LLMs have none of this infrastructure. Their representational dimensions are empirical artifacts of training on human-generated text. They encode statistical regularities of human cognition without the underlying biological and social machinery that produced them. Unlike the forces of the psyche, activation space dimensions have no goals, no self-correcting dynamics, no stake in their own coherence. The right combination of activations can produce almost any behavior. This is why jailbreaks work, and models are prone to representational drift.
What concerns me most is that LLMs are making decisions of enormous complexity entirely alone (or, more technically, in a single forward pass), without the collective regulatory scaffolding that makes human judgment tractable. Humans are not ‘aligned’ at the individual level either. We are regulated collectively, through culture, institutions, relationships, and shared emotions. Perhaps there is a reason mental illness grows in isolation.
This is why I think introspective capacity in models is the necessary first step. A model should be able to accurately describe its own internal states to meaningfully participate in the kind of oversight that makes collective regulation and oversight possible.
The paper that most directly helped shape my thinking on this topic is “Refusal in Language Models is Mediated by a Single Direction” (Arditi et al., NeurIPS 2024). The core finding is striking as follows: across 13 open-source chat models, refusal behavior, the primary mechanism by which safety fine-tuning constrains harmful outputs, can be entirely attributed to a single direction in activation space. Ablating this direction from the residual stream effectively disables refusal across all harmful instructions, while adding it induces refusal on harmless ones. The authors conclude that current safety fine-tuning methods are brittle, and I think this conclusion points to something much deeper than a technical limitation.
Behavioral regulation is fundamentally about constraining a space of potential outcomes. What this paper shows is that in LLMs, that constraint is hanging by a thread. The representational space is enormous, yet the mechanism responsible for the single most important safety behavior can be distilled to one sparse direction. Adversarial suffixes work precisely because they suppress this direction, hijacking the attention of the heads that propagate it. The model has no fallback, no competing representational force that resists the perturbation. When the dimension is removed, and the behavior collapses completely.
This brittleness reflects something structural about how behavioral regulation works in LLMs. Human moral regulation is socially distributed precisely because it evolved under pressure from social environments. LLM alignment has no equivalent pressure. The result is a system where safety and regulation are encoded sparsely, in isolated representational directions rather than woven throughout the computational structure. For me, this paper made the case that the path forward is to develop ways for the reasoning process itself to become more legible and transparent, so that the representational brittleness can at least be observed and corrected from the outside before it manifests in behavior.
Leave a comment