Neuroevolutionary Wordle - Genetic Algorithm Design Concepts

The core concept is that a genetic algorithm will make changes to the weights of a neural net. This neural net plays Wordle.

The GA will run on domestic hardware, specifically an Nvidia 5070 Ti. At time of writing, that is very good domestic hardware, but it does create some constraints regarding memory usage. There are 16GB of VRAM available.

Existing Plan for the Neural Net

The CUDA coding for the neural net is in progress. In fact, it has barely begun. The Wordle-playing neural net design isn’t fixed yet, but in its current form calls for approximately 330,000 model weights. The design for the output embedding represents around half this.

The wordlists used have been minimised to an action space of ~5k words. This is around a 1/3rd of the action space of allowable guesses in a real Wordle.

Even with the action space stripped down to a third of the maximum size through wordlist curation, we are still looking at a model with ~330k parameters, half of which are in the output embedding.

Precision Reduction & Memory Usage

An early draft of the CUDA code for the input encoder of the model implied float32 weights. This has been reduced to a native CUDA _half type - essentially a float16. This gives a model size of appximately 660kB. On the back of an envelope, this puts a loose upper bound on the genetic algorithm population size of 24k. This is by no means bad for a GA, and its much better than the ~6k we would have seen at float32 with triple the action space, but it’s not massive. We should remember that there are other overheads:

  • activations;
  • fitness evaluation state;
  • in-progress recombination and mutation;
  • miscellaneous population bookkeeping.

Overall, the project provides sufficient excuse to play around with some fun ways of improving performance of a GA with limited population size.

Smaller Population Sizes as a Core Constraint

In evolutionary computation, a small population size may yeild poor results. Difficulty maintaining diversity gives premature convergence - a polite way of saying excessive inbreeding - and early takeover by lucky lineages. This pushes the search into a local optimum with minimal exploration of the search space.

Reducing model size, simplifying the evaluation pipeline, and pushing for a bigger population may help here. However, I am particularly interested in structural mitigations that seek to make a smaller population size behave like a larger one. In particular, I am playing around with ideas involving locality.

Locality - Core Concept

Organisms in a genetic algorithm can (if desired) have co-ordinates representing a physical location within a population. Not all GAs have this, but it seems likely that mine will.

Locality in Breeding - Spatially Aware Mating in Structured Populations

The concept of spacially aware mate selection is that candidate solutions in the genetic algorithm will only get spliced together to produce the next generation if they are spatially close to each other. Fitness must be an advantage as well, so we can still have selection pressure, but locality plays a role. This is analogous to organisms in the physical world breeding with the best among their neighbours.

What this aims to achieve is slowing down the spread of ‘good’ genes. This may not sound like a good thing, but in smaller populations we are seeking to avoid a ’lucky lineage’ with one good mutation taking over completely, suddenly wiping out our genetic diversity. Slowing the spread of an early winning strategy may allow it to combine with other elements from a broader gene pool.

A spatially structured population may help in reducing premature convergence by slowing the diffusion of genes. This is largely an experimental approach, and the outcome remains uncertain, but I hope for good diversity in a smaller population size.

Locality in Fitness Evaluation

In simple genetic algorithms, the fitness function and any data backing it are applied evenly to all candidate solutions in each generation. Some more complex GAs may use training data sampling to achieve quicker calculation of fitness. The fitness then affects the probability that the candidate solution will be chosen for recombination.

I am considering spatially aware training data sampling. The rough idea is that shards of training data are not immediately globally applicable. Initially at least, a shard has an effective range, influencing calculated fitness only within a given spatial neighbourhood. Later, its effective range would grow, and eventually encompass the whole of the population.

It should be acknowledged that this implies a non-stationary fitness landscape by design, giving the proposed solution more of an experimental flavour.

This idea seems worthy of exploration, for a couple of reasons.

Locality of Training Data - Initial Effects

Initially, training data shards act locally, so the landscape of the problem is varied, depending on the conceptual spatial position of an candidate solution. This provides an obvious way to improve diversity in a small population.

Effects of Training Data ‘Spreading’

Where training data can spread, we have a path to eventual consistency across the population. This avoids a situation in which local experts good at solving one part of the problem fail to generalise to a workable model.

What a Phased Curriculum Means in Wordle

In a Wordle game, using neural nets along the lines I am designing, a phased curriculum creates an interesting problem.

Initially it seems that the only issue is that gradually introducing new 5-letter words into the action space makes the models larger. This means we need a declining population size as the GA progresses, to stay within the limits of the GPU being used.

The more interesting matter is how to insert a new action into the action space in the output embedding of our model. A Wordle-specific solution to the problem of late action space injection will be the topic of another post.

Summary

To run a genetic algorithm to determine the weights of a neural net is the core concept of my Neuroevolutionary Wordle project. (Reinforcement learning and fine-tuning is a topic to be addressed when the GA has run.)

A GA in which the population members are ML models presents certain challenges. The desire to have thousands of models competing to breed inside a single GPU places limits on their size, and initial steps have been taken to reduce model size. Firstly, the action space was capped at around 5k, casting aside most of the 15k words you are allowed to play in a Wordle. Second, the model weight precision was dropped early in the design phase to float16. It could be increased in the reinforcement learning stage, if that were to seem appropriate, but this reduction immediately halves the memory footprint of a model.

More interesting are the evolutionary computation tricks being used to get the most out of a small-ish population. Giving models conceptual physical positions allows locality in selection and recombination. Locality in fitness evaluation - the concept that training data shards only take effect in part of the ‘map’ - preserves diversity. Candidate selection around the edges of the expanding front of a shard of data combines solutions effective at solving it with other genetic material, providing opportunity for better exploration of the search space around a local optimum. It is hoped premature convergence around sub-optimal regions of the search space.

The concept of phased curriculum - rolling out the training data and action space slowly - allows us to start with a larger population of smaller models. We can then reduce the population size as the models’ output embeddings grow in size.