Motivation
Linear Genetic Programming (LGP) evolves sequences of register-machine instructions to solve tasks — but assigning registers to actions in reinforcement learning environments has traditionally required manual, domain-specific mappings. This thesis asks whether Q-Learning can automate that process, letting the system learn which register outputs correspond to which actions.
Approach
RLGP layers a Q-Learning agent on top of LGP. The evolutionary process handles program structure — selecting, crossing over, and mutating instruction sequences — while the Q-Learning layer learns to map register states to environment actions during evaluation. The two mechanisms operate at different timescales: evolution across generations, reinforcement learning within each episode.
The system was evaluated on two OpenAI Gym benchmarks:
- CartPole-v1 — balance a pole on a moving cart. LGP alone achieved a mean reward of 454. RLGP solved the task but plateaued at 213, suggesting the exploration-exploitation balance needs tuning.
- MountainCar-v0 — drive an underpowered car up a hill. Both approaches struggled with the sparse reward signal, a known challenge for this environment.
Results
The hybrid approach demonstrates that automated register-action mapping is feasible. LGP’s evolved programs can serve as feature extractors for the Q-Learning layer, and the two learning mechanisms don’t destructively interfere. However, the Q-Learning component introduces additional hyperparameters (learning rate, discount factor, epsilon decay) that interact with LGP’s evolutionary parameters, making the combined search space harder to navigate.
Key Takeaway
The early plateau in RLGP’s CartPole performance points to a fundamental tension: the Q-Learning agent needs stable state representations to learn effectively, but evolution continuously changes the programs producing those representations. Freezing the evolutionary process periodically to let the RL layer converge, or using more robust RL algorithms that handle non-stationary environments, are promising directions.



