Going beyond average for reinforcement learning
Contemplate about the commuter who goes backwards and forwards every day on a train. On most mornings, her train is on time and she gets to her first meeting relaxed, poised, and ready. But it is in the back of her mind that the unforeseen does occur: a mechanical issue, a signal failure, or even an especially rainy day. Ultimately these hiccups create disruptions in her patterns, leaving her late, flustered, and confused.
Arbitrariness is something we face each day and has a considerable impact on how we perceive and experience the world. The same is the case in reinforcement learning (RL) applications, systems that learn through trial and error and are compelled by rewards. Conventionally, an RL algorithm forecasts the average reward it obtains from several attempts at an activity, and leverages this forecasting to determine how to behave. But arbitrary perturbations in the environment can modify its behaviour by altering the precise amount of reward the system obtains.
In a new research piece, it is illustrated that it is possible to model not just the average but also the complete variation of this reward, what we refer to as the value distribution. This has the outcome of RL systems that are far more precise and quicker to train than prior models, and more critically opens up the avenue of rethinking the entirety of reinforcement learning.
Going back to the instance of our frustrated commuter, let’s look at a journey made up of three segments of five minutes each, except that one time in a week, the train has a mechanical malfunction and breaks down, which adds an additional fifteen minutes to the trip. A simple calculation demonstrates that the average commute time is 3 x 5 + 15/5 = 18 minutes.
Within reinforcement learning, we leverage Bellman’s equation to forecast this average commute time. Particularly, Bellman’s equation relates our present average prediction to the average prediction we make in the immediate future. From the starting station, we forecast an eighteen minutes journey, (the average cumulative duration); from the second, we forecast a 13 minutes journey (average duration, minus the first segment’s length). Lastly, the assumption that the train hasn’t yet malfunctioned, from the third station we forecast there are 8 minutes (13-5) left in our commute, until lastly we come to our destination. Bellman’s equation makes every forecasting in sequence, and updates these based on new data.
Something that’s a bit counterintuitive about Bellman’s equation is that we don’t actually observe these forecasted averages: either the train takes 15 minutes (on 4 days from 5) or it takes about 30 minutes, never 18! From an essentially mathematical perspective, this isn’t an issue, as decision theory informs us we only require averages to make the ideal choice. As an outcome, this issue has been mostly passed up on, practically speaking. However, there is now tons of empirical evidence that forecasting averages is a complex business.
In a new research paper, it is illustrated that there is as a matter of fact a variant of Bellman’s equation which forecasts all potential outcomes, without averaging them. In our instance, we maintain dual predictions – a distribution, at every station, if the trip goes well, then the times are 15, 10, and then 5 minutes, respectively. However, if the train breaks down, then the times are 30, 25, and lastly 20 minutes.
All of reinforcement learning can be recast under this new viewpoint, and its application is already having the outcome of shocking new theoretical outcomes. Forecasting the distribution over outcomes also opens up all types of algorithmic possibilities, like:
- Disentangling the causes of arbitrariness: after we observe that commute times are bimodal, i.e., they take up two potential values, we can act on this data, for instance looking for train updates prior to leaving home.
- Differentiating between safe and risky options: when two choices possess the same average result (for example walking or taking the train) we may favour the one which has minimal variance. (walking)
- Natural auxiliary predictions: forecasting a multitude of outcomes, like the distribution of commute times has been illustrated to be advantageous for training deep networks quicker.
These ideas were subsequently implemented within the Deep-Q Network Agent, substituting its single average reward output with a distribution possessing 51 potential values. The only other alteration was a new learning rule, reflective of the transition from Bellman’s (average) equation to its distributional counterpart. Fascinatingly, it turns out going from averages to distributions was all that was required to surpass the performance of all other comparable strategies, and by a broad margin. The graph below indicates how we get 75% of a trained Deep-Q-Network’s performance in a quarter of the time, and accomplish considerably improved human performance.
One shocking outcome is that we observe some arbitrariness within Atari 2600 titles, even though Stella, the underlying game emulator is itself completely predictable. This randomness props up partially due to what’s referred to as partial observability, owing to the internal programming of the emulator, the agents playing the title of Pong cannot forecast the precise time during which their score increases. Visualization of the agent’s forecasting over successive frames we witness two independent outcomes (low and high) reflective of the potential timings. Even though this intrinsic arbitrariness doesn’t directly influence performance, the outcomes illustrate the limits of our agents comprehension.
Randomness also occurs as the agent’s own behaviour is uncertain. In the seminal title Space Invaders, our agent goes about learning to forecast the future probability that it might commit a mistake and lose the game (zero reward).
Much like in the train journey instances, it makes logical sense to keep independent predictions for these considerably different results, instead of aggregate them into an unrealisable average. As a matter of fact, we believe that our enhanced results are in great part attributable to the agent’s capability to model its own arbitrariness.
It’s at this point, evident from the empirical results that the distributional perspective has the outcome of improved reinforcement learning with greater stability. With the potential that each reinforcement learning concept could now want a distributional counterpart, it might only be the start for this strategy.