Neuroevolutionary Wordle - Coding the Model in CUDA

At this stage of the project, the codebase is still mostly about forward inference rather than training. The genetic algorithm, reinforcement learning, and performance tuning will come later. For now, the focus is on getting the model structure into a shape that is simple, explicit, and testable.

The broad idea is straightforward enough. There is a small Wordle game layer, a set of model components that turn a game state into a policy vector, and an output embedding that turns that policy vector into an actual word. The code has gradually been split so that these concerns are easier to see in the tree.

Folder Structure

The most useful split at the moment is between src/wordle and src/model.

src/wordle/ contains the game-side data structures and rules. This is where the code lives for words, turns, tile feedback (green/yellow/grey), and the WordleGrid itself. The knowledge of the rules of the game live here.

src/model/ is where the neural net side lives. That folder is now split into a few smaller pieces:

src/model/model_input adapts a WordleGrid into the fixed-size vector consumed by the main network
src/model/input_encoder contains the per-turn encoding logic
src/model/dense_trunk contains the main dense network trunk
src/model/policy_model ties model-input construction and the dense trunk together
src/model/output_embedding scores the policy output against the action space

There is also a small src/common folder for reusable low-level pieces like fixed-size buffers, CUDA compatibility macros, and FP16 helpers.

The result is a structure that reads roughly from left to right:

WordleGrid
-> model input
-> dense trunk
-> policy vector
-> output embedding
-> chosen word

That is not the final shape of the whole project, but it is a decent shape for the current stage.

A Few Core Data Structures

One thing I have tried to keep simple is the representation of data. Most of the important types are fixed-size value types rather than heap-owning structures.

A Word is just five letter indices. A Turn is a Word plus five feedback values. A WordleGrid is the hidden solution, up to six turns, and a turn count. It also has a small convenience method, isVirgin(), which is used by the model-input code to indicate whether no guesses have yet been played.

This was a more recent design decision, not yet taken when I wrote the neural net design. I felt it was necessary because, depending on implementation details within the neural net, a zero-value input might not be a good idea. I didn’t want zeros propagating through the system to the output vector, because it then gets multiplied by all the candidate actions in the output embedding by dot product. This would result in all actions getting a score of zero. We now have a single boolean ‘isVirgin’ value in the input as well, to guarantee a non-zero input value.

On the model side, the same general style continues. The code uses fixed-size buffers heavily, which makes both CPU tests and CUDA code easier to reason about. There is no framework-style tensor machinery here. The shapes are known in advance, and the code tends to say so directly.

That gives the model path a fairly explicit sequence:

A turn is expanded into one-hot input features.
An occupied turn is passed through a shared encoder.
Up to five encoded turns are packed into one model-input vector.
A single extra scalar is prepended to that vector to indicate whether the grid is virgin.
The dense trunk maps that model input to a 64-dimensional policy vector.
The output embedding scores candidate words by dot product against that policy vector.

At the moment the model-input vector is 321 values wide: one virgin-grid flag, followed by five 64-dimensional turn encodings.

The Output Embedding

The output embedding is probably the most unusual part of the model design, so it is worth mentioning briefly here.

Each action in the action space is represented by a 64-dimensional embedding. The first 26 dimensions are derived directly from the word itself. In the current implementation they are count-based: a letter contributes +1 if it appears once, +2 if it appears twice, and so on, while letters that do not appear contribute -1.

This negative value allows the model to clearly communicate to the output embedding that it doesn’t want a given letter to appear in the chosen word.

The remaining 38 dimensions are trainable parameters.

The model therefore does not produce one output neuron per word. Instead, it produces a 64-dimensional policy vector. Every action embedding is scored against that vector by dot product, and the highest-scoring action wins.

This keeps the whole output side fairly compact, and it also makes the fixed, intelligible part of the action representation explicit in code.

Dense Layers Without Too Much Machinery

The dense-network code is deliberately lightweight right now. Parameters are stored in fp16 for compactness, but the forward pass converts values to float as it goes. The code is built around explicit loops and fixed-size buffers rather than around a larger linear algebra framework.

That approach is not especially glamorous, but it is a good fit for a project that is still at the ‘make the architecture concrete’ stage. It keeps the implementation easy to inspect, and it means the CUDA path and the CPU test path can stay close to one another.

Testing Strategy

The testing strategy follows the same broad idea as the folder structure: keep local concerns easy to test, then add a smaller number of end-to-end checks on the GPU.

There are ordinary CPU unit tests for the Wordle logic and for each main model component:

game-state handling
turn feature encoding
the shared encoder
model-input construction
the dense trunk
the policy-model wrapper
the output embedding

On top of that, there are CUDA integration tests that exercise the real device path. These do not stop at tiny helper functions. They build actual game state, run the forward path on device, and check the results that come back.

That GPU coverage matters here because this is not a project where CUDA is an optional acceleration layer bolted on later. CUDA is the environment the code is being written for. If the device path is wrong, the project is wrong.

It also helps me enormously that I’m only coding for this project on a machine with a 5070 Ti or a 5050 GPU.

CTest labels are used to distinguish CPU and GPU tests.

Where This Leaves the Project

The project now has a reasonably clear forward-inference pipeline:

WordleGrid
-> shared input encoding
-> 321-value model input
-> dense trunk
-> 64D policy vector
-> output embedding
-> chosen word

That is still only part of the overall ambition for Neuroevolutionary Wordle, but it is a solid place to be. The code is now split by concern in a way that is easier to navigate, the core data structures are simple, and the testing strategy includes real GPU-backed end-to-end coverage rather than pretending that host-only tests are enough.

The next interesting work will be less about code layout, and more about training, parameter initialisation, action masking, and the eventually the actual genetic algorithm, which I am particularly looking forward to developing.

Neuroevolutionary Wordle - Coding the Model in CUDA

Part of the Neuroevolutionary Wordle series