When Tokens Bend - Trimming Circles

[ recreational , math , ai ]

“It is not knowledge, but the act of learning, not possession but the act of getting there, which grants the greatest enjoyment. When I have clarified and exhausted a subject, then I turn away from it, in order to go into darkness again. The never-satisfied man is so strange; if he has completed a structure, then it is not in order to dwell in it peacefully, but in order to begin another.”
—C. F. Gauss

My sleep routine consists of unwinding by reading just before hitting the pillow. Mostly hard sci-fi. I like feeding my subconscious a soup of ideas. It’s funny how the softest parts of our day can incubate the most radical thoughts. In that regard, Greg Egan is easily one of my favorite authors. Permutation City left its mark on me. I read it twice, most recently in 2021. This was my review of it:

The sheer thought of learning a truth that pervades the confinement of your experience, percolating beyond the vastness of your scientific enterprise, was a subtle and cogent exposition of the nomological potency of underlying mathematical structures. Peering through the pages of Greg right into the litany of Plato felt almost vestigial.

It probably wasn’t Greg’s intention for the book to be a treatise on the nature of metaphysics. Nonetheless, my mind was struck by the vivid storytelling. For the first time I viscerally understood what metaphysics meant. It’s quite subtle, and a handful of philosophers otherwise overlooked by my own apathy toward the subject started to gain traction in my mind. Given the experience, I thought it’d not be such a bad idea to pick something else Greg wrote. Having said that, I’ve recently started reading Diaspora.

Less than a quarter into Diaspora, Greg managed to uncoil a thought I didn’t even know was waiting. Yatima the Orphan supercomputer and the protagonist, started contemplating mathematical truths in the Truth Mines. Among them, there was a reference to the Gauss-Bonnet theorem. And this brought back some memories since my college days. I had to study for an analysis exam and in typical Leo-fashion, I started going down orthogonal rabbit holes. I was quite fascinated about Riemann at the time and I was reading about Riemannian geometry more than it benefited my already poor grades.

I won’t go into too much detail about what the Gauss-Bonnet theorem has to say. For my intent and purposes, it’s sufficient to say the following.

Think of a sphere. While it’s a 3D object, mathematicians often focus on its surface, which is a 2D shape called a 2-sphere (denoted \(S^2\)). This surface has curvature, and the Gauss-Bonnet theorem elegantly links its total curvature to a topological property called the Euler characteristic, which distinguishes shapes like spheres from donuts or pretzels.

At every point on the sphere’s surface, there’s an abstract 2D plane called a tangent plane. If you are standing on the sphere, this plane represents all the directions you could walk in from that point.

To study shapes like this, mathematicians use manifolds, which are spaces that locally resemble flat (Euclidean) space near every point. A manifold doesn’t inherently know about distances or angles; it’s a flexible structure that captures how a shape is “glued together” topologically.

The tangent plane acts as a flat approximation of the sphere’s surface at each point. If you zoom in extremely close, the surface seems almost flat, much like how Earth feels flat beneath your feet. To measure distances or angles, you need an extra tool: a metric. For a sphere, mathematicians often borrow this metric from the 3D space it’s embedded in, which lets us calculate things like the shortest path between two points (hint: it’s a segment of a “great circle”).

The geometry of the sphere can be studied by examining all its tangent planes collectively, which is a structure called the tangent bundle. Understanding how these 2D planes connect and vary across the surface unravels the sphere’s geometry.

With that in mind and hyped for coding, I thought wouldn’t it be cool if I can find a way to actually use some of that theory in an applied setting? I didn’t have to wait too long before I imagined that maybe I could visualize text as a projection onto a sphere. I thought, “Well, OpenAI is giving me logprobs, a maximum of 20 tokens per token.” Then I realized that I can actually use those logprobs to compute the entropy per token and visualize it as a bumpy sphere whose bumps are given by the entropy of each token. At this point I had enough to start coding something.

The leap from abstract manifolds to LLMs might seem counterintuitive, but here’s the intuition: when the model hesitates between “perhaps” (high entropy) and “certainly” (low entropy), it’s navigating a landscape. My idea is to essentially render this terrain, choosing the sphere as my canvas. The full code of what I’m about to show is available as a gist here. Let me briefly explain how it works.

For each generated token, I compute its entropy based on 20 logprobs (since that’s the maximum OpenAI offers). Each token in a sequence is assigned spherical coordinates. The azimuth \(\theta\) is the token’s position in the sequence, stretched across a circle (\(0 \leq \theta < 2\pi\)). The polar angle \(\phi\) (\(0 \leq \phi \leq \pi\)) is the token’s probability. As you trace a path from one pole to the other you’ll go from regions with lower probability (north pole) to regions with higher probability (south pole). Lastly, \(r\) (\(r \geq 0\)) is the base sphere’s radius inflated by the token’s entropy. This gives “bumpiness” proportional to the token’s entropy. High entropy becomes mountainous regions on the sphere; low entropy becomes smoother plains.

To compute the Gaussian curvature, I found an excuse to use unoptimized JAX for automatic differentiation, calculating first and second derivatives of the parametric surface with respect to \(\theta\) and \(\phi\). I’ve used the approach that follows the ratio of determinants between first and second fundamental forms.

The visualization uses Plotly to create an interactive 3D surface where color represents curvature. I added hover functionality to display each token’s text, entropy, probability, and curvature value when you mouse over different parts of the sphere. In cases where the curvature shows extreme outliers, I implemented optional clipping to focus on the most informative range of values.

What fascinates me about this visualization is how regions of high curvature often reveal moments where the model decisively shifts its internal state, snapping sharply toward a dominant token when it is highly confident. In these low-entropy regions, the model’s hidden state rapidly reorients, producing a pronounced bend in its trajectory. Conversely, when the model is more uncertain and distributes probability more evenly across many tokens (high entropy), the transition is smoother, resulting in relatively lower curvature even though the underlying probability surface may appear jagged due to many low-probability candidates.

For this example, I asked gpt-4o to write a philosophical essay about the scientific method. In the below interactive visualization you’ll find on the left side the run with temperature 0, and on the right side the run with temperature 1. It’s quite interesting to notice that the curvature color gradient increases progressively along the longitude when temperature is 0. That’s not true when temperature is 1, where we see pockets of higher curvature amid relatively smooth regions. We can also observe that whenever the entropy is low, the curvature is high, and vice-versa.

Feel free to explore the interactive visualization below. Have fun!