[reward hackingvideo; cf. Braitenberg vehicle, PIDcontrol theory] Past attempts to get computers to ride bicycles have required an inordinate amount of learning time (1,700 practice rides for a reinforcement learning approach1, while still failing to be able to ride in a straight line), or have required an algebraic analysis of the exact equations of motion for the specific bicycle to be controlled2, 3. Mysteriously, humans do not need to do either of these when learning to ride a bicycle.
Here we present a 2-neuron network that can ride a virtual bicycle in a desired direction (for example, towards a desired goal or along a desired path), which may be chosen or changed at runtime.
Just as when a person rides a bicycle, the network is very accurate for long range goals, but in the short run stability issues dominate the behavior. This happens not by explicit design, but arises as a natural consequence of how the network controls the bicycle
Figure 2: Instability of an unsteered bicycle. This shows 800 runs of a bicycle being pushed to the right. For each run, the path of the front wheel on the ground is shown until the bicycle has fallen over. The unstable oscillatory nature is due to the subcritical speed of the bicycle, which loses further speed with each oscillation.
(Actually, the title of this paper is unproven. We have not ruled out the possibility that a single neuron could ride a bicycle.)
…In the language of reinforcement learning, such a controller is exactly what you would get after one step of policy iteration, if you start with the null policy of never touching the handlebars, and allow yourself 3 actions at each step (push left, push right, or no push). Then if the controller learns the value function for this policy (which in practice would require lots of experience with not touching the handlebars, but which we simulate by giving the controller access to the simulator), it can then act greedily with respect to that value function. This amounts to one step of policy iteration, and at least for the goal of not falling over, an optimal policy is indeed obtained after a single iteration (ie. it successfully doesn’t fall down). However, it does not do this in a conventional way, say by riding in a straight line, but rather manages to maintain stability at near-zero speed by doing stunts with the front wheel, for example by spinning the handlebars in circles (the handlebars and front wheel do not bump into the frame for our bicycle, and there are no cables to get twisted, so why not?). A movie of this bizarre behavior can be seen.
Despite many attempts at formulating a sensible value function, we found it difficult to get sensible behavior out of the bicycle. By rewarding uprightness, the bicycle would stop riding normally and start doing stunts as described above. If we tried to discourage this by rewarding speed, the bicycle would swoop from side to side, where each swoop results in a temporary increase in speed. If we tried to discourage this by rewarding going in a straight line, the bicycle would do this very nicely, but of course it would fall over right away, as avoiding the fall would have required deviating from the straight line. Of course, one could try weighted combinations of these or other ideas, but then the question starts to be not how long it will take the controller to learn to ride the bicycle, but how long it will take us to learn how to program the controller to get it to ride normally. As has been pointed out by people who have worked with reinforcement learning, it can be a very tricky business trying to pick a good value function.
In this work I created an environment in which I assumed learning of higher order concepts would be necessary. I created a general purpose physics-based hinged rigid body simulator and used it to simulate a bicycle for a learning agent to learn to ride. But as others have found in other contexts, most environments do not require higher order concepts, and in this case it turned out that a simple 2 neuron circuit was already sufficient for controlling the bicycle, indeed, better than a human using the keyboard.
The physics simulator was a substantial project in itself: I studied rigid body mechanics (which is more complex than most of us realize) and designed the simulator explicitly so as to simultaneously exactly preserve both angular momentum and kinetic energy, as previous simulations of mine (in quantum mechanics!) had shown me that preserving conserved quantities can be crucial for getting accurate results. The simulator works nicely, and I later read in Sam Buss’s 2001 paper, “Accurate and Efficient Simulation of Rigid Body Rotations”, that my precautions were well warranted.
Later I was able to reduce the controller to just one neuron, and then to an even simpler plain linearfeedback system, confirming the finding that many real problems are best solved by a hack. The simulator has been useful in other projects since then.