Neuroevolutionary Wordle - Output Embedding

Following on from the post on designing the Wordle-playing Neural Net, within the broader series of posts on Neuroevolutionary Wordle, we expand in more detail on the design of the model’s output embedding.

First, a toy example. Then we look at the real design.

Toy Example

The output embedding is best thought of as a graph with some vectors on it. Let us consider a toy example to explain this concept.

Imagine we have a 3-dimensional graph with axes labelled A, B and C. We can plot a point at \( (1,1,1) \), and say that that is the vector for the word CAB, meaning taxi. We can draw a vector from the origin to \( (1,1,1) \) and call that the vector that “means” cab, if we want.

Now we add another word vector to our graph: the word AB, being short for ‘abseil’. As it has an A and a B but no C, our system might turn this into a vector from the origin point \( ((0,0,0) \) to \( ((1,1,-1) \). The negative 1 implies that there is no C in the word AB.

Suppose we have a clever neural net which outputs a vector. Which word in our two-word dictionary does this output vector “mean”?

Let us suppose the output vector is \( (10,10,10) \). Let’s use the dot product method to figure out which word this means.

$$ \mathbf{v} = (10,10,10) $$

Candidate output-embedding vectors:

$$ \mathbf{a} = (1,1,1) $$

and:

$$ \mathbf{b} = (1,1,-1) $$

Their dot products with the policy vector are:

$$ \mathbf{v} \cdot \mathbf{a} = 10\cdot1 + 10\cdot1 + 10\cdot1 = 30 $$

and:

$$ \mathbf{v} \cdot \mathbf{b} = 10\cdot1 + 10\cdot1 + 10\cdot(-1) = 10 $$

Since \( 30 > 10 \), the vector \( (1,1,1) \) is a better match to the policy output than \( (1,1,-1) \). Therefore the output vector “means” CAB much more than it means AB. Our Neural Net has output CAB as its next guess in our word game.

Real Output Embedding Structure

In reality, the output embedding is a more complex graph. The vector space is 64-dimensional. The first 26 dimensions are hard-coded, and actually mean something concrete: they refer to letters of the alphabet. Their values are ’non-trainable’, meaning they never change when the Genetic Algorithm or Reinforcement Learning take place. The word CABAL is represented by a vector in this space where the A, B and C ordinates are always 1. The D ordinate is always -1. (Note that this is a presence-only check, not a count.) This will never change with training.

The other 38 dimensions are fully trainable. They don’t mean anything conceptually, but in time will come to encode useful distinctions for the model. It might decide, as a result of its training, that one of the dimensions represents how early a word is in the alphabet, giving RAISE and ARISE different vectors, despite being anagrams. The truth will likely be messier, and we will never have a satisfactory answer to what a trainable output embedding vector means, in the same way that weight parameters don’t really mean anything.

Overall, the output embedding is a 64-dimensional vector space, with about 5,000 vectors on it. Each vector is labelled with a 5-letter word that is an element of the action space of the model. When the output head spits out a 64-dimensional vector, the model computes dot products against each of the candidate vectors. Biggest dot product wins. A modest amount of action masking enforces the rule that you’re not allowed to play the same word twice in a Wordle.

Why Use Output Embedding?

This output embedding allows the Neural Net to compress its output into a 64-value vector. The alternative to using output embedding would have required one output value for each of the ~5k words in the action space. These values would have been logits, or probabilities that a given word is the model’s preferred action.

A direct 4,739-way output head would be feasible in principle, but would increase parameter count and scoring cost considerably, especially in a setting involving large populations and many evaluations. The Genetic Algorithm will need a large population, many generations, and must run inference on the model a number of times to compute its fitness function. The embedding approach keeps the policy head fixed at 64 outputs and shifts most action structure into the embedding table.

5k neurons in the output head is computationally expensive. Where genetic algorithms are involved far more than is desirable for a model running on domestic hardware, particularly where genetic algorithms are involved. The number of generations, population size, and the number of Wordle games that must be played to compute the fitness function meant that output embedding was necessary.