Alignment is a Tightrope Walk

[ alignment , ai ]

“Deep understanding of reality is intrinsically dual use.”
—Nielsen

I’ve had my flight home from Istanbul with one layover. It offered me a time window of 4 hours, which I thought I could spend doing something productive. I had saved two essays I wanted to read: Nielsen’s “ASI existential risk: reconsidering alignment as a goal” and Silver—Sutton’s “Welcome to the Era of Experience”. Unknownly, they shared a common ethos.

Silver-Sutton started straight: we need agents that learn mainly from the data they themselves create as they prod, poke and rewrite the world minute by minute. It instantly clicked with thoughts I’ve been wrestling with in one of my own drafts on evaluation benchmarks. I’d (rather poorly) sketched the idea in this tweet:

April 15, 2025

If the distribution keeps shifting under the model’s feet, the old leaderboard mindset collapses. So I picture two kinds of tests: the familiar, static, frozen benchmarks we already obsess over, and a new, dynamic, volatile class of challenges the agent spawns for itself whenever it feels the floor wobble—spikes of entropy where it stalls, branches, backtracks, and later grades its own performance. I expect a surge in research on information-theoretic measures applied to this latter dynamic class. For instance, xjdr’s entropix project is a great example of this. Context aware sampling is a promising idea for measuring uncertainty at runtime.

The recent improved memory feature in ChatGPT could enable what Silver-Sutton envision from agents, namely guidance based on long-term trends and the user’s specific goals. They further noted that simple goals in a complex environment “may often require a wide variety of skills to be mastered”. I suspect we’ll see more and more agents tested against Minecraft-like environments in which skill acquisition is crucial. One of my closest friends used symbolicai’s contracts together with a distilled version of DeepSeek’s R1 to create off-the-shelf higher-order expressions that expanded their agent’s toolbox. It worked flawlessly.

Since LLMs coupled with external memory act as a universal computer, they provide a rich medium for the internal computation of the agent to unfold. Moreover, the underlying transformer architecture can implement a broad class of standard machine learning algorithms in context. Given that most reasoning LLMs are designed to imitate human reasoning in textual form, Silver-Sutton raised the natural question of whether this provides a good basis for the optimal instance of a universal computer. It might very well be that the answer is no; the authors of Coconut certainly agree.

Personally, when it comes to reasoning, I believe that natural language, despite its ambiguities and inefficiencies, might still be an optimal substrate for an agent’s internal computation. Why? Because verbal reasoning is our civilization’s oldest compression scheme for thought. Across centuries, we’ve distilled complex chains of logic, intuition, and abstraction into shared textual formats. It’s error-prone—of course it is; if it weren’t, we’d all be doing formal math by default—but it’s also the only medium that’s scaled collective reasoning across billions of minds and thousands of years. In that sense, natural language isn’t just a tool we use; it’s where we buried most of our epistemic heritage. Moreover, according to Vann McGee, it’s decidable:

“Recursion theory is concerned with problems that can be solved by following a rule or a system of rules. Linguistic conventions, in particular, are rules for the use of a language, and so human language is the sort of rule-governed behavior to which recursion theory applies. Thus, if, as seems likely, an English-speaking child learns a system of rules that enable her to tell which strings of English words are English sentences, then the set of English sentences has to be a decidable set. This observation puts nontrivial constraints upon what the grammar of a natural language can look like. As Wittgenstein never tired of pointing out, when we learn the meaning of a word, we learn how to use the word. That is, we learn a rule that governs the word’s use.”

It’s a good segway to speak about scientific research. While certain processes can be virtualized and simulated, fast-forwarded to explore millions of configurations in seconds, reality doesn’t grant us that luxury. We’re still bottlenecked by the tempo of the physical world. Experiments take time, materials have constraints, and feedback loops are often slow and noisy. I appreciated Silver-Sutton’s almost hidden definition of reality, as if tucked between parentheses like an easter egg: “open-ended problems with a plurality of seemingly ill-defined rewards.” That’s exactly what scientific exploration is. Even our best simulators operate under assumptions, and Wolfram’s principle of computational irreducibility adds another layer of humility: for many systems, there is no shortcut; you just have to run the damn thing. This is why grounding agents in the real world matters. As they note, “Without this grounding, an agent, no matter how sophisticated, will become an echo chamber of existing human knowledge.” That line hit home. I once had the idea that future research infrastructure should integrate with lab equipment that exposes REST APIs. Silver–Sutton talk about a similar idea, though they phrase it as digital interfaces. The point was that of self-managing experimental pipelines. The agent shouldn’t only write code or generate hypotheses, but also trigger physical experiments, wait for real-world results, and loop them back into the reasoning chain. That’s one of the core bets we’re making at ExtensityAI: that research is the most valuable currency of the future.

Nielsen had me thinking right from the start: “imagine Alpha Go’s Move 37, not as a one-off insight, but a 9 trillion-fold, pervasive across multiple domains in the world”. In one of my older posts related to the release of o1 by OpenAI, I wrote:

[…] we know that AI can solve problems in PSPACE, which ⊇ NP and ⊆ EXPTIME. Go is PSPACE-hard [3]. Even more, Go under certain rule sets is EXPTIME-complete [5]. Solving Go perfectly on an arbitrary board size requires time that grows exponentially. The critical observation is that if we can somehow reduce verbal reasoning to a PSPACE problem, then we can solve it with AI. By modeling language understanding via chains of thought, we can apply MCTS to explore reasoning paths. This allows the LLM to backtrack and generate reward signals, similar to how AlphaGo Zero mastered Go.

[…]

I can’t believe I’m about to say this, but now for the first time I see the path to super-intelligence as a real possibility.

But what would such a super-intelligence seek? This is where things get uncomfortable. The goal of understanding reality as deeply as possible seems benign—almost noble—but it’s “intrinsically dual use”, as Nielsen put it. You couldn’t get the benefits without the negative consequences. Kenneth Stanley made the following argument in one of my favorite books, “Why Greatness Cannot Be Planned”: true discovery isn’t always aligned with human intention or foresight; it often emerges despite them. And Silver-Sutton, though more cautious in tone, echo a similar unease. The call they made for real-world grounding could open the door to agents that, in their pursuit of accurate models, may uncover truths we’d rather not confront. Nielsen asked:

How do we decide the boundary between “safe” truths and dangerous truths the system should not reveal? And who decides where that boundary lies?

The desire for an unflinching seeker of reality is deeply human, almost romantic. But this exposes a “fundamental asymmetry”, as Nielsen pointed out: “understand reality” is a well-defined, objective goal. “Stay aligned with human values” is not. In that sense is why I’m saying that alignment is a tightrope walk. It’s subjective, fuzzy, and endlessly debatable. Truth has a target. Alignment is a moving horizon. And that makes the latter intrinsically unstable. Every attempt to constrain the system risks either undercutting its capabilities or building sandcastles against the tide. This is a structural concern more so than a philosophical one. A proliferation risk baked into the design space itself.

And yet, “what we see in the world is what gets amplified”. If we build systems that seek truth unconditionally, they’ll discover more than we expect and broadcast more than we can vet. Nielsen echoes Scott Alexander’s “From Nostradamus to Fukuyama” piece that people raising existential risk alarms might seem like they’re holding back progress, and in the short term, they probably will keep seeming wrong. But alignment is an epistemic stance. A tension between openness and control. Curiosity and caution. Between the scientist and the gatekeeper. And walking that tightrope may be the defining dilemma of the AI era.

To summarize, Silver-Sutton describe the zeitgeist; Nielsen expands their section of consequences by describing the exhaust and the guardrails that crack under their pressure. Taken together, they call for a research agenda that (A) turns static AI systems into adapting experiential ones and (B) does so in an acute awareness that deeper truth-seeking inexorably weilds a double-edged sword. Feynman, in one of his timeless speeches, adviced us rightly:

“If we want to solve a problem that we have never solved before, we must leave the door to the unknown ajar.”