Neuroevolutionary Wordle - Recombination

Recombination means taking two genotypes, and combining them into a child. So far in my project, I have tried scalar crossover, and it has been a bit of a disaster.

Scalar crossover means every individual parameter has a 50% chance of coming from one parent, and a 50% chance of coming from another. The incoming weights, biases, and outgoing weights of individual neurons get split up in a way that doesn’t really work, and the fitness evaluations flatline as a result. There is basically no evolution.

Model Design

My models have an input encoder. This may be used up to 5 times, because it is possible for there to be up to 5 previous turns in a Wordle game. The inputs represent both a 5-letter word, and the colours of all the tiles (green, yellow or grey).

The models also have a dense trunk, of 3 hidden layers.

Finally, there is an output embedding vector space, where each output embedding represents a 5-letter word in the model’s vocabulary. Because of the way the training works (phased curriculum), these get added as we go.

Proposal for Recombination Plan

Here is my proposed plan for recombination, with regards to actual genetic crossover…

The first parent to have been chosen for breeding becomes the ‘main’ gene donor.

Level 1

crossover temperature level 1 is a float in [0,1]. If it is 0.02, which is the default, then…

  • There is a 2% probability that the entire input encoder will come from the parent that did not provide the rest of the genotype;
  • There is a 2% probability that the entire dense trunk will come from the parent that did not provide the rest of the genotype;
  • There is a 2% probability that the entire set of output embeddings will come from the parent that did not provide the rest of the genotype.

Level 2

crossover temperature level 2 is a float in [0,1]. If it is 0.01, which is the default, then…

  • There is a 1% probability that any given layer of neurons in the input encoder will come from the parent that did not provide the rest of the input encoder;
  • There is a 1% probability that any given layer of neurons in the dense trunk will come from the parent that did not provide the rest of the dense trunk;
  • There is a 1% probability that a single random output embedding will be provided by the parent that did not provide the rest of the output embedding vectors.

Level 3

crossover temperature level 3 is a float in [0,1]. If it is 0.005, which is the default, then…

  • There is a 0.5% probability that a single neuron in the input encoder will be provided by the parent that did not provide the rest of the relevant layer;
  • There is a 0.5% probability that a single neuron in the dense trunk will be provided by the parent that did not provide the rest of the relevant layer;
  • There is a 0.5% probability that one of the output embeddings will get spliced. When this happens, it is constructed from both parents’ equivalent output embeddings, with a one-point, intra-row crossover.

For these purposes:

  • A neuron is a set of input weights and a bias. It is theoretically possible for two neurons in adjacent layers to be ‘swapped’ with their opposite numbers from the other parent. The order of execution (earlier layers first) will mean that the later layer, towards the dense trunk’s policy head, will ‘win’ with regards to the shared input/output weights.
  • Random numbers don’t need to be deterministic, nor do they need to be cryptographically secure. Any reasonable entropy RNG on-device will do.
  • Splicing an output embedding refers to taking some of its 38 trainable weights from one parent, and some from another. The single crossover point is chosen randomly within that 38-value space. The 26 hard-coded letter counts should already be the same for both parents, and are not trainable here.

Summary

This plan provides a cautious ‘genes that work together, stay together’ approach to splicing genotypes. It also avoids the problems already seen with scalar crossover. It also avoids being weird genetic algorithm that uses single-parent asexual breeding. (A GA like that is basically just a hill-climbing algorithm in a tuxedo.)

The design will be used in an experiment, to see if it overcomes the fitness evaluation flatlining problem. Early, dumber versions suggest this is directionally correct.