Agent57 – beating the Atari benchmark
The Atari57 collection of games is a time-tested benchmark to evaluate agent performance throughout a broad array of activities. Agent57 is the first deep reinforcement learning agent to attain a score that beats the human benchmark across all 57 Atari 2600 games, the classic console of yesteryear. Agent57 brings together an algorithm for effective exploration with a meta-controller that goes about adapting the exploration and short vs long term actions of the agent.
How to go about measuring Artificial General Intelligence?
To say that the focus has been on the development of agents that perform well on a broad array of activities within the domain of artificial general intelligences would be a huge understatement. An agent that features adequate performance on a relatively broad array of activities is categorized as intelligent. Games are a brilliant evaluation ground for developing adaptive algorithms – they furnish a robust suite of activities which players must develop advanced behavioural techniques to master, but they also furnish a simple progress metric, game scores to optimize against. The eventual objective is not to produce systems that succeed at gaming, but instead to leverage games as a stepping stone for producing systems to learn to thrive at a wide array of challenges. Usually, human performance is considered as a baseline for what adequate performance on activity means, the score achieved by agent on each activity can be quantified in comparison to human performance levels, furnishing a human normalised score, 0% means that an agent has random performance, whereas 100% or above implies the agent features human level performance or even better.
Almost a decade back, in 2012, the Arcade Learning Environment – a collection of 57 Atari 2600 games, referred to as Atari57 was put forth as a benchmark group of activities: these canonical Atari titles put forth a broad array of challenges for an agent to go about mastering. The scientific community typically leverages this benchmark to quantify progress in developing successively more smart agents. It’s typically desired to summarize the performance of an agent on a broad array of activities as a singular number, as so mean (average) performance, this could also be the median score on the Atari57 benchmark is typically leveraged to summarize an agent’s capabilities. Average scores have gradually improved with the passage of time. Unluckily, the average performance cannot record how many activities an agent is performing good on, and so is not an ideal stat for deciding how general an agent is: it captures that an agent is performing adequately well, but not that it is adequately well on an adequately broad array of activities. So even though average scores have improved, until now, the number of above human games have not.
As a demonstrative instance, look at a benchmark comprising of twenty activities. Assume Agent A gets a score of 500% on eight activities, 200% on four activities, and 0% on eight activities, (mean = 240%, median = 200%) while agent B gets a score of 150% on all activities, (mean = median = 150%). On average, agent A features improved performance in contrast to Agent B. Although, Agent B has a more generic capability, it attains human-like performance on more activities than Agent A.
This problem is exacerbated if some activities are much simpler than others. By featuring excelling performance on very simple activities, agent A can seemingly outpace agent B, which features good performance on both simple and difficult activities.
Scientists have concentrated on maximization of agent’s average performance on the Atari57 benchmark suite ever since it was conceived, and average performance has considerably improved over the previous eight years. However, like the demonstrative instance above, not all games are developed equal, with some titles being much simpler than other titles. Over examination of the average performance, if we look at the performance of agents on the bottom 5% of the titles, we observe that not much has altered since 2012; as a matter of fact, agents put out in 2019 were facing a struggle on the same games with which agents put out in 2012 struggled. Agent57 alters this, and is a more generalized agent in Atari57 over any agent since the conceiving of the benchmark. Agent57 ultimately gets above human-like performance on the most difficult games on the benchmark set, in addition to the simplest ones.
9 years ago, in 2012, DeepMind produced the Deep Q-network agent (DQN) to take on the Atari57 suite. Since that time, the scientific community has produced several extensions and alternatives to DQN. Regardless of this progression, however, all deep reinforcement learning agents have constantly not scored in four games: Montezuma’s Revenge, Pitfall, Solaris, and Skiing.
The former two titles need comprehensive exploration to get good performance levels. A fundamental dilemma in learning is the exploration-exploitation issue: should an agent keep repeating actions one knows functions within the game – exploit, or should an agent attempt something novel – explore to identify new techniques that might be even more effective? For instance, would you order your favourite drink repeatedly at a bar, or would you try something new that might supersede your old favourite? Exploration consists of taking several suboptimal actions to collect the data needed to discover an eventually more robust behaviour.
The latter two titles are longer-term credit assignment issues: in these titles, it’s a challenge to match the effects of an agent’s behaviour to the rewards it obtains. Agents must gather data over the longer term to obtain the feedback required to go about learning.
For Agent57 to manage these four challenging titles and also the other titles in the Atari57 benchmark, various alterations to DQN were needed.
Early enhancements to DQN improved its learning efficacy and stability, which included double DQN, prioritized experience replay and duelling architecture. These alterations enabled agents to make more effective and efficient utilization of their experience.
Following this, researchers introduced distributed variations of DQN, Gorilla DQN and ApeX, which could be executed on several computers at the same time. This enabled agents to go about acquiring and learning from experiences more swiftly, facilitating researchers to swiftly iterate on concepts. Agent57 is also a distributed reinforcement learning agent that decouples the data gathering and the learning procedures.
Several actors communicate with independent copies of the environment, inputting information to a centralized memory back in the shape of a prioritized replay buffer. A learner then samples training information from this replay buffer, like how a person might recount memories to learn from them. The learner leverages these replayed experiences to build loss functions, through which it predicts the cost of behaviours or events. Following which, it updates the parameters of its neural network by reducing losses. Lastly, every actor shares the same network architecture as the learner, but with its proprietary copy of the weights. The learner weights are delivered to the actors constantly, enabling them to update their own weights in a fashion decided by their individual priorities.
Agents are required to possess memory to take into account prior observations into their decision making. This facilitates the agent to not just base its decisions on the current observation – which is typically partial, meaning an agent only views part of its world – but also on prior observations, which can unveil more data about the environment in its entirety. Visualize, for instance, an activity where an agent goes from room to room in order to quantify the number of chairs in a building. With no memory, the agent can just depend on the observation of a single room. With memory, the agent can recall the number of chairs in prior rooms and merely add the number of chairs it sees in the current room to find a solution to the activity. Thus, the role of memory is to aggregate data from previous observations to enhance the decision-making procedure. Within the domains of reinforcement learning and deep learning, recurrent neural networks like Long-Short Term Memory (LSTM) are leveraged as short-term memories.
Interfacing memory with behaviour is critical for developing frameworks that learn autonomously. Within reinforcement learning, an agent can be an on-policy learner, which can just learn the value of its direct behaviour, or an off-policy learner, which can go about learning about optimum behaviours even not indulging in those respective behaviours – for instance, it might be performing arbitrary actions, but can still learn what the ideal potential action would be. Off-policy learning is thus a desirable attribute for agents, enabling them to learn the ideal course of action to take while comprehensively looking into their environment. Bringing together off-policy learning with memory is a challenge as you are required to be aware of what you may recall when carrying out a differing behaviour. For instance, what you may opt to recall when you are searching for an apple – where the apple is located – differs from what you may opt to recall if you are searching for an orange. However, if you were searching for an orange, you could still go about learning how to find the apple if you happened to find the apple by chance, in case you are required to look for it in the future. The first deep reinforcement learning agent bringing together memory and off-policy learning was Deep Recurrent Q-Network (DRQN).
In more recent times, a noteworthy speciation in the lineage of Agent57 happened with Recurrent Replay Distributed DQN (R2D2), bringing together a neural network framework of short-term memory with off-policy learning and distributed training, and accomplishing a very robust mean performance on Atari57. R2D2 alters the replay mechanism for learning from prior experiences to work with short-term memory. In unison, this facilitated R2D2 to effectively go about learning profitable behaviours, and exploit them for reward.
Never Give Up (NGU) was developed to augment R2D2 with another variant of memory: episodic memory. This facilitates NGU to identify when new portions of a game are encountered, so the agent can delve into these newer portions of the game in the scenario that they provide rewards. This makes the agent’s behaviour of exploring deviate considerably from thce policy the agent is attempting to learn, which is getting a high score in the title. Therefore, off-policy learning once more has a vital part here. NGU was the first agent to collect positive rewards, without domain awareness, on Pitfall, a game on which no agent had made any points since to conceiving of the Atari57 benchmark, and other challenging titles. Unluckily, NGU gives up performance on what have typically been the simpler games, and so on average, features lower performance in comparison to R2D2.
To identify the most efficient strategies, agents must go about exploring their environment – but a few exploration techniques are more effective in comparison to others. With DQN, researchers made an effort to address the exploration issue by leveraging an undirected exploration technique referred to as epsilon-greedy: with a fixed probability (epsilon), undertake an arbitrary action, or else choose the present best action. However, this family of strategies do not scale well to hard exploration issues: without rewards, the need a prohibitive amount of time to go about exploring large state-action spaces, as they are reliant on undirected random action choices to find unobserved states. To surpass this limitation, several directed exploration techniques have been put forth. Among these, one strand has concentrated on generating intrinsic motivation rewards that encourage agents to go about exploring and visiting as many states as feasible by furnishing more dense “internal” rewards for novelty-seeking behaviours. Inside of that strand, we differentiate two variants of rewards, to begin with, long-term novelty rewards promote visitation several states throughout training, spread across several episodes. Second, short-term novelty rewards promote visitation of several states over a short timeframe – for instance, within a singular episode of a game.
Long-term novelty rewards indicate when a prior unobserved state is encountered in the agent’s lifetime, and is a function of the density of the states observed so far in training, that is, it’s adjusted by how frequently the agent has observed state like the present one in comparison to states observed cumulatively. When the density is high (signifying that the state is familiar), the long-term novelty reward is not high, and vice versa. When all the states are known, the agent resorts to an undirected exploration technique. But, learning density frameworks of high dimensional spaces is fraught with issues owing to the curse of dimensionality. In practice, when agents leverage deep learning models to learn a new density model, they are impacted from catastrophic forgetting (failing to recall data observed previously as they face new experiences), in addition to an inability to generate accurate outputs for all inputs. For instance, in Montezuma’s Revenge, unlike undirected exploration techniques, long-term novelty rewards facilitate the agent to surpass the human baseline. But, even the best performing methods on Monetzuma’s Revenge require to meticulously train a density framework at the correct speed, when the density model signifies that the states in the first room are familiar, the agent should be capable to consistently get to unfamiliar territory.
Short-term novelty rewards can be leverage to promote an agent to go about exploring states that have not been faced in its recent history. Recently, neural networks that emulate some attributes of episodic memory have been leveraged to quicken up learning in reinforcement learning agents. As episodic memories are also thought to be critical for recognizing novel experiences, these models were adapted to provide Never Give Up a notion of short-term novelty. Episodic memory frameworks are effective and dependable candidates for computation of short-term novelty rewards, as they can swiftly learn a non-parametric density model that can be adapted on the fly (without requiring to learn or adapt parameters of the model). In this scenario, the magnitude of the reward is decided by measuring the distance between the current state and prior states documented in episodic memory.
But, not all notions of distance promote meaningful forms of exploration. For instance, take up the activity of navigation of a busy city with several pedestrians and vehicles. If an agent is coded to leverage a notion of distance where each tiny visual variation is considered, the agent would visit a big number of differing states merely by passively looking at the environment, even standing still – a fruitless variant of exploration. To prevent this scenario, the agent should instead go about learning features that are observed as critical for exploration, like controllability, and compute a distance with relation to these features only. These models have prior been leveraged for exploration, and bringing them together with episodic memory is one of the primary advancements of the Never Give Up exploration technique, which had the outcome of better than human performance levels in Pitfall.
Never Give Up (NGU) leveraged this short-term novelty reward on the basis of controllable states, combined with a longer-term novelty reward, leveraging Random Network Distillation. The mix was accomplished through multiplication of both rewards, where the long-term novelty is bounded. In this fashion the short-term novelty reward’s impact is maintained, but can be down-modulated as the agent becomes more acquainted with the game over its lifespan. The other fundamental idea of NGU is that it goes about learning a group of policies that range from essentially exploitative to very exploratory. This accomplished by utilizing a distributed setup, by developing on top of R2D2, actors generate experiences with differing policies on the basis of differing criticality weighting on the cumulative novelty reward. This experience is generated uniformly with relation to each weighting in the grouping.
Agent57 is developed on the following observation: what if an agent can go about learning when it’s better to exploit, and when it’s better to explore? The notion of a meta-controller was put forth that adapts the exploration-exploitation trade-off in addition to a time horizon that can be adjusted for titles needing longer temporal credit assignment. With this alteration, Agent57 is capable of obtaining the best of both worlds, better than human performance on both simple titles and difficult titles.
Particularly, intrinsic motivation strategies have two pitfalls (pun intended):
- Exploration: Several games are amenable to policies that are essentially exploitative, especially after a title has been completely explored. This has the implication that a majority of the experience generated by exploratory policies in Never Give Up will ultimately become wasteful after the agent searches all relevant states.
- Time horizon: Some activities will need long time horizons (for instance, Skiing, Solaris) where valuation of rewards that will be earned in the distant future might be critical for ultimately learning a robust exploitative policy, or even to go about learning a good policy at all. Simultaneously, other activities might be slow and unstable to learn if future rewards are overly weighted. This trade-off is typically managed by the discount factor in reinforcement learning, where a higher discount factor facilitates learning from longer time horizons.
This compelled the leveraging of an online adaptation mechanism the manages the amount of experience generated with differing policies, with a variable-length time horizon and importance assigned to novelty. Analysts have attempted tackling this with several methods, which includes training of a population of agents with differing hyperparameter values, directly undertaking learning of the values of the hyperparameters by gradient descent, or leveraging a centralized bandit to go about learning the value of hyperparameters.
A bandit algorithm was leveraged to choose which policy the agent should utilize to produce experience. Particularly, a sliding-window UCB bandit was trained for every actor to choose the degree of preference with regards to exploration and time horizon its policy should possess.
To accomplish Agent57, the prior exploration agent, Never Give Up was brought together with Agent57 leveraging a meta-controller. This agent goes about computing a combo of long- and short- term intrinsic motivation to explore and learn a grouping of policies, where the option of policy is chosen by the meta-controller. The meta-controller enables every actor of the agent to opt for a differing trade-off between near vs. long term performance, in addition to exploring new states vs. exploitation of what’s known prior. Reinforcement learning is essentially a feedback loop, the actions opted decide the training data. Thus, the meta-controller also decides what information the agent learns from.
With Agent57, there has been success in developing a more generally smart agent that has better than human performance on all activities in the Atari57 benchmark. It adds on top of the prior agent Never Give Up, and instantiates an adaptive meta-controller that facilitates the agent to be privy as to when to go about exploring and when to exploit, in addition to what time-horizon it would be good to learn with. A broad array of activities will naturally need differing options of both of these trade-offs, thus, the meta-controller furnishes a way to dynamically adapt such options.
Agent57 was capable of scaling with appreciating amounts of computation, the more it trained, the more its score increased. While this facilitated Agent57 to accomplish robust general performance, it takes a lot of computing and time, the data efficiency can definitely be enhanced. Also, this agent demonstrates better 5th percentile performance on the group of Atari57 titles. This is by no stretch of imagination the conclusion of Atari research, not just with regards to data efficiency, but also with regards to general performance. There are two perspectives on this matter: first, analysis of the performance amongst percentiles provides us fresh insights on how general algorithms are. While Agent57 accomplishes robust outcomes on the first percentiles of the 57 games and features improved mean and median performance than R2D2 or NGU, as demonstrated by MuZero, it could still accomplish a better than average performance level. Second, all present algorithms are far from accomplishing optimum performance in some titles. To that end, key enhancements to leverage might be enhancements in the representations that Agent57 leverages for exploring, planning, and credit assignment.