Posted on April 20, 2026

Types and Neural Networks

[This is cross posted to the GLAIVE blog ]

Neural networks are used to generate increasingly more code in languages which enable highly generic and provably correct programming: Idris, Lean, and Agda, for example.

However, most frontier models generating the code – Large Language Models – separate the process of training from the process of typechecking. They are trained to produce output of a fixed type: List Token. To get valid code, that output is then during post-training parsed, in multiple ad-hoc ways, into the types of a particular language.

What are these ad-hoc ways? Do they work, and should we expect them to? And more importantly, can we rebuild LLMs from the ground up so that they’re trained to produce typed output?

Enforcing types after training

Today, LLMs are trained to predict the next token in a given training corpus, resulting in functions of type

LLM : List Token -> D (List Token)

where D (List Token) here denotes a distribution over a list of tokens. LLMs consume an input prompt and produce this distribution one token at a time – sampling from the next-token distribution and feeding the result back in. For the purposes of this blog post, we’ll consider Token to simply be Maybe Char, i.e. either a character the model is able to generate, or a STOP token which signals the termination of the generative process.¹

The approaches in this section all take this trained network as fixed. They modify only the inference procedure, and fall along two axes:

Granularity: how often is the typechecker consulted? When the model emits a STOP token, or after every token?
Bandwidth: what does the typechecker communicate back? A structured error message, or merely a binary accept/reject signal?

These approaches sit at opposite extremes along these axes, and neither fundamentally solves typed generation, but we describe them here because they help us understand where the issue is.

1. Try; Compile; If Error Repeat

This is what most programmers do. They type raw text into the editor; the compiler either processes it into structured data, or returns an error the programmer has to internalise before resubmitting.

The same loop can be performed with LLMs.² Given the task of producing a value of type List Int, the model generates a candidate – say, “[ 1 , 2 , 3 ] STOP” – and the controller hands it to a typechecker. If it passes, we’re done. If not, we feed the error back and ask the model to try again. Along our two axes, this approach has low granularity (the typechecker is consulted only after the STOP token is produced) and high bandwidth (errors are structured messages the model can reason about).

But there is a problem. Suppose the task is instead to produce a value of Either Char Double, and the model begins with “[ 1 , 2” – already unrecoverable, since no value of this type starts with [. Nothing notices until the full generation completes! At that point the entire sequence is rejected and the model starts over. This is incredibly wasteful.

It can be remedied by tightening the loop – generating fewer tokens before each check – and in the limit this is what experienced dependently-typed programmers do: make a small move, see the compiler’s response, make the next move. But tightening only addresses granularity. The LLM, unlike the programmer, does not carry what it learns across sessions: nothing in this approach updates the weights, meaning that the next session starts from the same state as the last.

2. Constrained decoding

A different approach is to consult the typechecker before each token is sampled, and mask out any tokens that would lead to an ill-typed result.

For instance, if the model is generating a value of type List Int and has already produced the tokens “[ 1 , 2 , 3”, we set the probability for alphabetic characters to zero and sample from the remains.³ Despite the fact that type inference is in general not decidable, this approach is well-known, and has been applied to JSON schemas and restricted fragments of various type systems.⁴ Along our axes, this sits at the opposite corner: maximum granularity (every token), and minimum bandwidth (one bit per token). Here, output is guaranteed to typecheck! However, there is a big problem: this only prevents the model from saying certain things, and does not change what the model wants to say.

To see this, consider the case where the model has learned to assign high probability to alphabetic characters in our example. Masking them out forces it to sample tokens it considers unlikely. At each step, the mask checks only whether a token can lead to a well-typed output, without accounting for how much of the model’s probability mass remains on each branch. This can trap generation into branches where only low-probability completions exist, producing output that typechecks but is increasingly nonsensical.⁵

Like with the previous approach, no gradient flows through the mask, so the model cannot learn to adapt and assign a higher probability to densely populated branches, as the weights are never updated.

Learning types during training

The above approaches, as strange as it may seem, work. Frontier models using retry loops are making genuine progress: FrontierMath scores climbed from under 2% upon release to almost 50% in less than two years, and so did the scores for ARC-AGI-2 and SWE-bench. In chess, GPT-4 reaches roughly 1371 ELO (intermediate player) as a byproduct of general training.

These are significant results. But in domains where we know how to train models that utilise structure during training, we see even more dramatic performance improvements. Chess is one of those. The system AlphaZero utilises the game structure during training and reaches superhuman ELO (>3400) with ~30000x fewer parameters than GPT-4 (<60 million vs 1.8 trillion). And AlphaZero never makes an illegal move, compared to GPT-4 which makes one in 30% of games.

What about dependently-typed programming languages? These are an incredibly rich domain, and one can wonder what happens if we tightly integrate their compilers into neural network training. Unfortunately, this hasn’t happened yet: we don’t really know how to do that. Most current models generating typed code are playing the game without being told its rules. What is surprising is how good they are.

If chess is any indication of what happens to performance when you train using structure, encoding the rules of a language is worth figuring out. The only question is: how do you differentiate through type systems? How do you learn to produce a typed output, given that types are discrete, non-differentiable objects?

Differentiating through structure

To understand differentiation, let’s start with a simpler question. How do you write a function of the following type in Haskell, modelling a simple choice of output structure: a coproduct?

f :: (x, p) -> Either a b
f = ?

If you try to write it down, no matter how creative you get, it will end up having the shape

f (x, p) = if c (x, p) then Left (f_l (x, p))
                       else Right (f_r (x, p))

for some predicate c : (x, p) -> Bool and two maps f_l :: (x, p) -> a, f_r :: (x, p) -> b. There is no other option.⁶ To produce a Left or a Right, you must commit to one or the other by partitioning the input via c. When it returns True you go left, otherwise you go right.

This decomposition is not an accident of Haskell – it is one of the properties of the category \(\mathbf{Set}\), which is an extensive category.⁷ And this property is central to research on differentiating maps into coproducts, most notably CHAD and Higher Order AD of Higher Order Functions. They use this theorem, and seemingly – without retries or constrained decoding – show us how to build a network whose output is a structured type, at train time.

But as we’ve seen above – these approaches require us, the programmer, to partition the input space, instead of having the neural network automatically learn it. This means that, given a network \(f : X \times P \to A + B\), if your model starts out guessing the wrong branch \(A\), it can never fix it through gradient updates, because it was never able to learn it to begin with.⁸

The same story applies to autodiff through other structures, such as pairs, inductive types, and function objects. All of these relinquish the model’s ability to learn the structure. This is an important perspective, but one that’s overlooked in the aforementioned literature. Of course, this is justified since their focus is autodiff, and not learning. But I have not seen this distinction acknowledged, and it deserves to be said explicitly.

Often, we don’t want just a method for propagating gradients through a known program, instead we want the ability to learn the program.

Differentiating with respect to structure

There is a different way we can go about learning a program \(f : X \times P \to A + B\). Instead of fixing a partitioning, what if we learn it? This is coincidentally the answer to our question of how to backpropagate gradients while making discrete choices: we never differentiate through discrete choices to begin with. Instead, we collect evidence for each discrete choice, normalise it, and then sample to decide what next.

That is, we use three maps:

A differentiable map \(f_c : X \times P \to D(\text{Bool})\)
A differentiable map \(f_A : X \times P \to A\)
A differentiable map \(f_B : X \times P \to B\)

Then, to produce an output of type \(A + B\) given some \((x, p) : X \times P\), we apply \(f_c\) to \((x, p)\), and then sample from the resulting distribution. The resulting boolean tells whether to use \(f_A\) or \(f_B\). Another way to say this: \(f\) is now a map of type \(X \to D(A + B)\), producing a distribution over both the choice of the output types, and their values.

The question is now: how do we differentiate this, and train it using supervised learning? Perhaps surprisingly, this is straightforward. We differentiate only through the branch taken, and use cross entropy loss for comparing the produced distribution with the actual one. This means that, if we’re ready to provide these three neural networks, we can obtain an effectful neural network of type \(f : X \times P \to A + B\).

This approach produces well-typed output by construction. And unlike the previous approach, it allows the network to learn the right output type. It also directly subsumes the previous approach: setting the sampling temperature to t=0 recovers argmax, i.e. differentiation through structure.

Even more, this coproduct can be thought of as an instance of a dependent pair indexed by a finite set. This generalises in a number of different ways, and can be iterated to produce an array of typed structures, all in a completely principled way! As we’ll see (in future posts) – the thing that enables this is the theory of containers, and the only way to proceed with this is to build the requisite typed infrastructure!

It’s incredibly exciting.

Of course, this idea has to some extent been studied in the literature. In the most crude way, teacher forcing is an untyped variant of this. Abstract Syntax Networks, which appeared in 2017, also describe this, but without (dependent) types. Just like AlphaZero, ASNs predate the LLM era, but have not been scaled further. Perhaps that was due to a lack of prominence of dependently typed languages where this kind of an approach pays off. Who knows?

And of course, AlphaZero and AlphaProof are worth mentioning. They use constrained decoding, but at train time, updating only the chosen branch via reinforcement learning. In AlphaZero, this reached superhuman performance on Chess, Go, and Shogi. But when it comes to AlphaProof, the situation is different. Even though the reinforcement learning approach is similar, the output space of all the models is still just List Token. This means that the types shape which tactics receive reinforcement, but the gradients still flow through the coarse typing of tokens, completely decoupled from the types of the underlying programming language.

Final words

LLMs operate on a flat token vocabulary. Every structural technique we apply around them – such as retries or constrained decoding – is an attempt to impose meaning on outputs that were generated without any notion of meaning. What differentiating with respect to structure proposes is that the output space itself can carry that meaning. A token is no longer an opaque character or subword; it is tied to meaning of the domain in which we’re generating output. This means that, the more structured our programming language is, the bigger the benefit is.

It might appear that this is an argument against scale, and the Bitter Lesson. That is not the case. I see this as a move that lets scale do its work on the right object. As with chess, where encoding the game rules into training produces a leap that no amount of inference-time search can today match, the move here is to encode the programming language itself into the training, and apply scale on a structure that actually reflects what we’re trying to produce.

Lastly, the distinction between differentiating through structure and with respect to structure is essentially the distinction between treating type-level choices as fixed scaffolding versus treating them as something the network can learn. Both produce well-typed output, but only the latter enables the particulars of that output to be learned. The mathematical language for describing these typed action spaces uniformly – for sums, products, inductives, and dependent types alike – ends up being the language of containers and dependent lenses.

That is where this line of work gets interesting.

Thanks to Jules Hedges and Andre Videla for providing feedback on a draft of this post.

In practice, a token consists of either 1) a group of 2-4 characters chosen to approximate an entropy-optimal encoding (with more frequent character groups getting shorter codes), or 2) one of the opaque meta-tokens which carry no character content, but are instead used to send control-flow signal to the controller running the LLM. A basic example of this is the STOP token which, if sampled, causes the controller running the LLM to terminate the generation process. Other examples include BOS (beginning of sequence) token, a turn delimiter (used in user-facing models such as ChatGPT), tool-call tokens, and more. Tokens are usually even more involved, and aren’t defined on a character, but instead byte-level. This process (the choice of the type Token) and a function List Char -> List Token is called “tokenisation”, with many different, competing approaches.↩︎
This is what Claude Code does when invoked with the --json-schema command line option. (Notably, this contradicts their official documentation, though in reality it seems that both approaches are used when generating JSON.)↩︎
Technically, since we’re dealing with unnormalised logits before softargmax is applied, we need to set them to -inf.↩︎
Structured decoding is used in Claude code, and studied in papers such as Type-constrained code generation, and Grammar-Aligned Decoding.↩︎
Grammar Aligned Decoding (Example 3) unpacks this in more detail.↩︎
Syntactically you can write this in many ways, but semantically they are all the same.↩︎
Category-theoretically, extensivity studies properties of maps into a coproduct, rather than (the more usual) question of maps out of a coproduct. Extensivity tells us that given a map \(f : X \to A + B\), there exists two sets \(X_L\) and \(X_R\) and two maps \(f_L : X_L \to A\) and \(f_R : X_R \to B\) such that \(X \cong X_L + X_R\) and such that \(f\) is equal to \(X \xrightarrow{\cong} X_L + X_R \xrightarrow{f_L + f_R} A + B\).↩︎
A more precise statement is: \(X \times P\) is partitioned into two disjoint components. If at training initialisation \(f\) assigns some input \((x, p)\) to the left branch, then there are no gradient updates on \(p\) that can be made to result in \(p'\) such that \((x, p')\) lands in the right branch.↩︎

Bruno Gavranović