Dopamine and temporal difference learning – an enlightened relationship between AI and neuroscience
Motivation and learning are guided by external and internal rewards. Several of our everyday behaviours are directed by forecasting, or anticipating, whether a particular action will have the result of a conducive outcome. The research of how organisms go about learning from experiences to rightly expect rewards has been a productive domain of enquiry for well over one hundred years since Ivan Pavlov’s revolutionary research on conditioning. In the now landmark experiment in science, dogs were trained to associate food (as a reward), with the sound of a buzzer. The dogs started to salivate after they’d heard the buzzer, even before the food had gotten to their plates, meaning that they’d learned to forecast the incoming reward. In the original research, Pavlov estimated the animal’s anticipation by quantifying the volume of saliva they generated. But in recent years, researchers have started to decode the inner workings of how the brain goes about learning these expectations. Meanwhile, in close relation with this research of reward learning amongst animals, computer scientists have generated algorithms for reinforcement learning within artificial systems. These algorithms facilitate artificial intelligence systems to learn complicated techniques with no external guidance, directed rather by reward predictions.
A recent research piece put out in Nature, is discovering that one of the latest advancements in the domain of computer science – which provides considerable enhancements with regards to performance on reinforcement learning problems – might furnish a deep, parsimonious explanation for various prior unexplained aspects of reward learning in the brain, and provides new avenues of enquiry into the brain’s dopamine system, with prospective implications for learning and motivation disorders.
Reinforcement learning is a time-tested and most impactful concepts bringing together neuroscience and artificial intelligence. In the late 1980s, computer science researchers were attempting to generate algorithms that could go about learning how to execute complicate behaviours autonomously, leveraging only rewards and punishments as a point of guidance. These rewards would function to reinforce those behaviours which led to their conferment. To find a solution to a provided issue, it is required to comprehend how present actions influence future rewards. For instance, a student might learn through reinforcement that studying for an examination has the outcome of improved scores in tests and assessments. To forecast the cumulative future reward that will be the result of a behaviour or an action, it’s often required to reason several steps into the future.
A revolutionary step in identifying a solution to the issue of reward forecasting was the temporal difference learning (TD) algorithm. TD leverages a mathematical trick to substitute complicated reasoning with regards to the future with a really simplistic learning procedure that can generate the same outcomes. This is where the trick lies: rather than attempting to calculate cumulative future reward, TD merely attempts to forecast the combination of immediate reward and its own reward prediction at the next moment in time. Then, when the next moment comes along, holding new data, the new forecast is contrasted against what it was expected to be. If they’re differing, the algorithm calculates how differing they are, and leverages numbers closer together at each moment in time – matching expectations to reality – the complete chain of prediction slowly becomes more precise.
Simultaneously, in the late 1980s and the early 1990s, neuroscientists were having a hard time understanding the behaviour of dopamine neurons. Dopamine neurons are clustered in the midbrain, but transmit projections to several brain regions, prospectively broadcasting some globally meaningful message. It was obvious that the firing of these neurons had some relationship to reward, but their responses were also dependent on sensory input, and altered as the animals became more adept at a provided activity.
Through a stroke of luck, some scientists were versed in the latest developments of both neuroscience and artificial intelligence. These scientists observed in the mid-1990s, that responses in some dopamine neurons signified reward prediction errors – their firing indicated when the organism obtained more reward, or less reward, than it had received training to expect. These researchers therefore put forth that the brain leverages a TD learning algorithm: a reward prediction error is calculated, broadcasted to the brain through the dopamine signal, and leverage to guide learning. Since that time, the reward prediction error theory of dopamine has been evaluated and validated in thousands of experiments, and has become one of the most successful quantitative theories within neuroscience.
Computer scientists have persisted and enhanced the algorithms for learning from rewards and punishments. Since 2013, there’s been a concentration on deep reinforcement learning: leveraging deep neural networks to learn powerful representations within reinforcement learning. This has facilitated reinforcement learning algorithms to solve tremendously more advanced and useful problems.
One of the algorithmic developments that has made reinforcement learning function better with neural networks is distributional reinforcement learning. In several situations, particularly in the practical world, the amount of future reward that will be the outcome of a specific behaviour or action is not a perfectly known quantity, but rather consists of some degree of randomness.
Visualize a scenario where a computer controller avatar, which has received training to navigate an obstacle course, jumps across a gap. The agent is not sure about if it will fall, or reach the opposite side. Hence, the distribution of forecasted rewards has two bumps: one indicating the potential of falling, and one indicating the potential of successfully getting to the opposite side.
In these scenarios, a standardized TD algorithm goes about learning to forecast the future reward that will be obtained on average – in this scenario, failing to get hold of the two-peaked distribution of possible returns. A distributional reinforcement learning algorithm, on the other hand, goes about learning to forecast the full spectrum of future rewards.
One of the simplest distributional reinforcement learning algorithms has a very close relationship with standard TD, and is referred to as distributional TD. Whereas standard TD learns a singular prediction, or the average reward that is expected, a distributional TD network goes about learning a grouping of distinct predictions. Every one of these is learned through the same methodology as standard TD – through computation of a reward prediction error that details the difference between consecutive forecasts. However, the critical ingredient is that every predictor applies a differing transformation to its reward forecasting errors. Some predictors “amplify” or “overweight” their reward forecasting errors (RPE) selectively when the reward forecasting error is positive. This has the outcome of the predictor to learn a more optimistic reward prediction, correlating to a higher portion of the reward distribution. Other predictors amplify their negative reward prediction errors, and therefore learn more pessimistic predictions. In conjunction, a grouping of predictors with a diverse grouping of optimistic and pessimistic weightings map out the complete reward distribution.
Apart from its simplicity, another advantage of distributional reinforcement learning is that it’s very capable when leveraged in conjunction with deep neural networks. Over the previous half-a-decade, there’s been a massive deal of advancement in algorithms based around the original deep reinforcement learning DQN agent, and these are consistently assessed on the Atari57 benchmark of Atari 2600 games.
Why are distributional reinforcement learning algorithms so efficient? Even though this is still an active sphere of research, a critical ingredient is that learning about the distribution of rewards provides the neural network a more capable signal for shaping of its representation in a fashion that’s robust to alterations in the environment or modifications in the policy.
As distributional TD is so capable in artificial neural networks, a neural question arises: is distributional TD leveraged in the brain? This was the key question behind the previously mentioned research piece put out in Nature.
In that research piece, which was a collaboration with DeepMind and an experimental laboratory at Harvard University to undertake analysis of their recordings of dopamine cells in mice. The recordings were performed while the mice carried out a well-learned activity in which they obtained rewards of unpredictable magnitude, the activity of dopamine neurons was assessed and if it was more consistent with standard TD or distributional TD.
As detailed above, distributional TD is reliant on a grouping of distinct reward predictions. The first question that comes to mind is we can observe such genuinely diverse reward predictions within the neural information.
From prior research, we are aware that dopamine cells modify their firing rate to indicate a prediction error – that is, if an organism obtains more or less reward than it had expectations for. We are aware that there should exist zero prediction error when a reward is obtained that has an identical size as what a cell had forecasted, and hence no modification in firing rates. For every dopamine cell, the reward size was determined for which it did not alter its baseline firing rate. We refer to this as the cell’s reversal point. The researchers desired to know if these reversal points were differing amongst cells. There were significant differences amongst cells, with a few cells forecasting massive amounts of reward and other cells forecasting minimal reward. These differences were above and beyond the degree of difference we would expect to observe from arbitrary variability inherent in the recordings.
Within distributional TD, these differences within reward prediction prop up from selective amplification of positive or negative reward forecasting errors. Amplifying positive reward forecasting errors has the outcome of more optimistic reward predictions to be learned, amplifying negative reward prediction errors has the outcome of pessimistic predictions. Following this, the degree to which differing dopamine cells demonstrated differing comparative amplifications of positive versus negative expectations was quantified. Amongst cells, it was discovered that reliable diversity existed, which again, could not be described by noise. And, critically, it was discovered that the same cells which amplified their positive reward forecasting errors also had increased reversal points – meaning they were apparently tuned to expect increased reward volumes.
Lastly, distributional TD theory forecasts that these diverse reversal points and diverse asymmetries, across cells, should cumulatively go about encoding the learned reward distribution. Which led them to the final question – whether they could decode the reward distribution from the firing rates of dopamine cells. It was discovered that it was in fact feasible, leveraging just the firing rates of dopamine cells, to reconstruct a reward distribution (blue trace) which was a very close matching to the actual distribution of rewards in the activity that mice were taking part in. This reconstruction was reliant on interpretation of the firing rates of dopamine cells as the reward forecasting errors of distributional TD model, and performing inference to decide what distribution the model had undergone learning about.
Conclusion
To sum up, it was discovered that dopamine neurons within the brain were each tuned to differing levels of optimism or pessimism. If they were a musical choir, all of them would not be signing the same note, but be in harmony – every one with a consistent vocal register, like bass and soprano singers. Within artificial reinforcement learning frameworks, this diverse tuning develops a richer training signal that largely quickens up learning within neural networks, and it is speculated that the brain might leverage it for the same purpose.
The existence of distributional reinforcement learning within the brain has fascinating implications both for artificial intelligence and neuroscience. To start with, this discovery validates distributional reinforcement learning – it provides us improved confidence that artificial intelligence research is on the right road, as this algorithm is already being leveraged in the smartest entity we are currently privy to: the human brain.
Second, it puts forth new questions in the domain of neuroscience, and fresh insights for comprehending mental health and motivation. What occurs if a person’s brain listens selectively to optimistic versus pessimistic dopamine neurons? Does this give way to impulsivity, or depression? A strong point of the brain is its capable representations – how are these moulded by distributional learning? Once an organism learns about the distributing of rewards, how is that representation leveraged downstream? How does the variance of optimism across dopamine cells relate to other known forms of diversity within the brain?
Lastly, the hope is that bringing up and finding solutions to these questions will stimulate advancements in neuroscience that will give back to benefit AI research, finishing the virtuous circle.