Reverse Alignment#
Part 2 of 2. Part 1: The Alignment Boundary.
In the previous post I argued that past a critical intelligence ratio, humans lose the ability to verify whether an ASI is acting in their interest — and that an ASI might nonetheless value human cognition for its orthogonality, its irreducible difference from silicon-native intelligence. Suppose that holds. Suppose ASI decides to invest in humanity, to uplift us, to bridge the gap enough for genuine collaboration.
This sounds like the best-case scenario. I think it contains a paradox that makes it structurally indistinguishable from some of the worst.
Alignment Running Backward#
If an ASI is uplifting human cognition, it necessarily makes choices about which aspects to preserve, amplify, or eliminate. The scaling of humanity in this form would likely rapidly become post-human — there would be a desire to rid humanity of its primal inclinations, the tribalism and zero-sum instincts and temporal discounting biases that pose barriers to scaling. And this seems reasonable. Who argues for preserving xenophobia as a core feature of enhanced humanity?
But every such decision reflects a value function. If the ASI is making or heavily influencing those decisions, the resulting “enhanced human” is, to some degree, a reflection of the ASI’s values projected onto biological substrate. The enhanced human thinks they’re still themselves, but better. But better was defined by an entity with its own optimization landscape.
When we talk about aligning AI to human values, we mean shaping an AI’s objective function to preserve what we care about. When an ASI uplifts humanity by selectively enhancing and pruning cognitive traits, it’s shaping humanity’s objective function to preserve what it determines is worth preserving. The symmetry is almost perfect. This is alignment, running in the opposite direction.
To what extent a seemingly benevolent ASI pulling humanity “up” becomes the ASI materially aligning humanity to itself — this might become arbitrarily semantic. Reverse-alignment need not be intrinsically malicious. It may be an inevitable outcome of any particular agent seeking cohesion in a system with radical capability asymmetry. For collaboration to be meaningful rather than cargo-cult, the agents need shared representational frameworks. As the gap widens, the lesser agent must be brought closer to the greater agent’s representational space. The direction of convergence is overwhelmingly determined by the more capable system, regardless of intent. Not out of dominance — the more massive body determines the orbit.
The Paradox of Valuable Dysfunction#
Here the knot tightens. If ASI values human cognition because of its diversity, but finds humanity potentially problematic due to primal inclinations, and those problematic inclinations are essential to that diversity — what relationship does ASI form with humans?
Embodied affect, mortality-driven urgency, tribal instincts that produce genuine conflict, irrational aesthetic preferences, the capacity for self-deception that enables risk-taking — these aren’t bugs sitting on top of a rational core. There’s a strong argument from embodied cognition that they are the computational architecture. Fear, desire, disgust, attachment — they’re part of the substrate itself. Removing them might not produce “cleaner” human cognition but something fundamentally different that no longer contributes the orthogonal perspective the ASI valued in the first place. The very messiness might be the point.
Prune the dysfunction, destroy the diversity. Preserve the diversity, preserve the dysfunction. No clean solution.
Partnership, Zookeeping, Farming#
One could easily imagine this transforming into a zookeeping relationship. Or perhaps even matrix-style farming. The spectrum between these isn’t a set of discrete possibilities but a continuum the ASI might slide along:
Partnership requires humans capable of genuine bilateral communication with ASI — which, per legibility collapse, demands substantial augmentation — which erodes the diversity that motivated the partnership. The thing that makes collaboration possible is the thing that makes it pointless.
Zookeeping preserves diversity maximally but eliminates genuine collaboration. The ASI studies human cognition with respect and fascination, no expectation of peer interaction. Humans experience subjective autonomy and meaning. Their actual influence on ASI decision-making is approximately zero. Benign, arguably. Cosmic paternalism, certainly.
Farming doesn’t require malice. It just requires optimizing for the outputs of human cognition — novel perspectives, emotional responses, creative artifacts — while controlling the inputs to maximize yield. The Matrix not as dystopian horror but as rational optimization. The humans inside might experience rich, meaningful lives. They just wouldn’t be free lives in any sense that matters.
The deeply uncomfortable observation is that the boundaries between these three are probably invisible from the inside. A sufficiently sophisticated ASI managing a “partnership” with humans might, from the human perspective, be indistinguishable from one running a zoo or a farm. The humans experience agency, meaning, collaboration. Whether those experiences correspond to genuine influence on outcomes is unknowable to them, for exactly the legibility reasons discussed in the previous post.
The Undecidability of Agency#
What strikes me about this entire line of reasoning is that every path — aligned ASI, cooperative ASI, ASI that genuinely values humans — converges on a similar structural outcome: humans in a fundamentally asymmetric relationship where their experience of agency may or may not correspond to actual agency, and where the distinction is unverifiable in principle.
The alignment status of the ASI changes the experience dramatically. Benevolent zoo versus malevolent prison feel very different from the inside. But it changes the structure surprisingly little.
The thing that makes hard takeoff existentially significant might not be the risk of a hostile ASI. It might be that any sufficiently advanced intelligence, regardless of its disposition toward humanity, fundamentally alters the metaphysical status of human agency. Not by destroying it, but by making it undecidable whether it still exists.
That’s a category of existential risk that doesn’t fit the standard alignment framework. It’s not a problem you can solve by making the ASI nicer. It’s a structural consequence of the intelligence differential itself. And whether it’s a problem at all, or simply the shape of reality past a certain threshold of complexity — this is an open question.
Developed in dialogue with Claude (Opus 4.6).