Quick RL via the composition of behaviours
Visualize that you had to go about learning how to chop, peel, and stir every time you desired to learn a new dish to cook. In several machine learning frameworks, agents typically have to learn completely from the ground up when encountered with new hurdles to surpass. It’s obvious, however, that individuals learn more effectively than this, they bring together capabilities learned prior. In the same fashion a finite directory of words can be structured into sentences of near limitless meanings, individuals repurpose and re-combine abilities they already have to manage unique challenges.
Within the scope of nature, learning happens as an organism explores and has interactions with its environment to collect food and other rewards. This is the paradigm captured by Reinforcement Learning (RL) interacting with the environment reinforces or inhibits specific patterns of behaviour dependent on the outcome reward or penalization. Lately, the combo of reinforcement learning with deep learning has led to impressive outcomes, like agents that can become aware of how to play board games like Chess and Go, the full library of Atari games, in addition to more sophisticated and intricate video game titles such as StarCraft II and DoTA.
A primary restriction in RL is that present methods need massive amounts of training experience. For instance, in order to learn how to play a singular Atari game, a reinforcement learning agent usually takes in an amount of data correlated with several weeks of playing with no breaks. A research led by analysts at Harvard and MIT suggested that in a few cases, humans are capable of attaining the same performance in a mere fifteen minutes of play.
One potential reason for this difference is that, differing from human beings, reinforcement learning agents typically go about learning a new activity from the ground up. Ideally, agents should utilize know how acquired in prior activities to go about learning a new activity more quickly, in much the same way an athlete gets acclimatized to sports over the passage of time.
To demonstrate the strategy, we will look into an instance of a task that is (or at least used to be) a daily routine: commuting to work. Visualize the following scenario: an agent must go about commuting daily from their home to their work, and they always get some tea on the way to work. There are a couple of cafes between the agent’s home and the office: one has really good tea but is on a longer path, and the other one has passable tea but a shorter commute. Dependent on how much the agent values the quality of the tea versus how much of traffic there is on a particular day, it may opt for one of two routes.
Conventionally, reinforcement learning algorithms fall into two wider categories: model-based and model-free agents. A model-based agent develops a representation of several aspects of the environment. An agent of this variety might be aware of how the differing locations are connected, the quality of the tea in each outlet, and anything else that is thought of as being connected. A model-free agent has a much more concise representation of its environment. For example, a value-based model-free agent would have a singular number connected with every potential route from its home, this is the expected value of every route, reflective of a particular weighting of tea quality vs. commute length.
We can provide an interpretation of the comparative weighting of the tea quality vs. the commute distance as the agent’s preferences. For any static grouping of preferences, a model-free and a model-based agent would opt for the same route. Why should you then have a more complicated representation of the world, like the one leveraged by a model-based agent, if the end outcome is the same? Why learn so much with regards to the environment if the agent winds up sipping the same tea?
Preferences can evolve or devolve on a daily basis, an agent might consider its levels of hunger, or if it’s late for a work conference, in creating a plan regarding its route to the office. One method for a model-free agent to manage this is to go about learning the best route connected with every potential grouping of preferences. This is not the best-case scenario as learning every potential combo of preferences will be very time intensive. It is also not feasible to go about learning a route connected with each potential grouping of preferences if there are limitless numbers of them.
By comparison, a model-based agent can go about adapting to any grouping of preferences, with no learning, by “visualizing” all potential routes and asking how well they would satisfy its present mindset. But, this strategy also has pitfalls. First, mentally producing and assessing all potential trajectories can be intensive, from a computational standpoint. Second, developing a model of the entire world can be very tough in complicated environments.
Model-free agents learn quicker but are brittle to alterations. Model-based agents feature flexibility but also experience slow learning. Is there a middle ground?
One of the latest research pieces in behavioural science and neuroscience indicates that in specific scenarios, human beings and animals make decisions on the basis of an algorithmic model that is a middle ground between the model-free and model-based strategies. The theory is that, like model-free agents, human beings also go about computing of alternative techniques in the form of a number. But, rather than summarize a singular quantity, human beings summarize many differing quantities detailing the universe around them, much like model-based agents.
It’s doable to bestow a reinforcement learning agent with the same capability. In our scenario, such an agent would possess, for every route, a number indicating the expected quality of tea and a number indicating the distance to the workplace. It might additionally have numbers connected with things the agent is not deliberately attempting to optimize but are nonetheless available to it for future reference. The factors of the universe the agent is concerned about and maintains note of are at times called “features”. Owing to that, this representation of the world is referred to as successor features – termed prior as the “successor representation” in its original incarnation.
Successor features can be viewed as a compromise between the model-free and model-based representations. Like model-based representations, successor features summarize several differing quantities, capturing the universe beyond a singular value. But, similar to the model-free representation, the quantities the agent keeps note of are simplistic stats summarising the features it is concerned about. In this fashion, successor features are like an “unpacked” variant of the model-free agent.
Successor features are a good representation as they enable for a route to be assessed under differing groupings of preferences. In our scenario, the agent is making a choice amongst routes. In a more general sense, the agent will be looking for a policy, a prescribed action plan of what to do in every potential scenario. Routes and policies have a close relationship, in our scenario, a policy that opts to take the road to Café A from their house and then opts for the road to the workplace from Café A would go through the blue path. So, in this scenario, we can make references to policies and routes in an interchangeable manner. We spoke about how successor features enable a policy or route to be assessed under differing groupings of preferences. We refer to this procedure as generalized policy evaluation or GPE.
What is the utility of GPE? Let’s assume the agent has a listing of policies (for instance, familiar routes to the workplace). Provided a grouping of preferences, the agent can leverage GPE to instantly assess how well each policy in the dictionary would feature performance under those preferences. Now on to the really fascinating part: on the basis of this swift assessment of known policies, the agent can develop completely new policies on the go. The manner in which it goes about this is simple: each time the agent has to undertake a decision, it brings up the following question: “If I were to make this decision and then adhere to the policy with the maximum value thereafter, which decision would have the outcome of the maximum cumulative value?” Shockingly, if the agent chooses the decision which has the outcome of the maximum overall value in every situation, it winds up with a policy that is usually better than the individual policies leveraged to develop it.
This procedure of stitching together a grouping of policies to develop an improved policy is referred to as generalized policy improvement, or GPI.
The performance of a policy developed through GPI will be dependent on how many policies the agent is aware of.
To demonstrate the advantages of GPE and GPI, we will now look at one of the experiments that have been recently published. The experiment leverages a simplistic environment that represents in an abstract fashion the variant of issue in which our strategy can be useful. As illustrated in Figure 6, the environment is 10×10 grid with ten objects spread out across it. The agent just obtains a non-zero reward if it picks up an object, in which scenario another object pops up in an arbitrary location. The reward connected with an object is dependent on its type. Object variants are intended to be representative of concrete or abstract ideas, to relate it to our scenario, we will take up each object as “tea” or “food”
Obviously, the best technique for the agent is dependent on its present preferences over tea or food. For instance, an agent only concerned about tea may adhere to the path in pink, while an agent concentrating exclusively on eating would adhere to the blue path. We can additionally visualize intermediate scenarios in which the agent desires tea and food with differing weights, which includes the case in which the agent wishes to avoid one of them. For instance, if the agent desires tea but really doesn’t wish to eat, the gray pathway in the above figure is the choice to go with.
The hurdle in this issue is to swiftly adapt to a new grouping of preferences (or an activity). In the experiments it was illustrated how one can do so leveraging GPI and GPE. Our agent went about learning two policies: one that looks for tea and one that looks for food. It was then evaluated how well the policy computed by GPE and GPI had performance on activities related with differing preferences. In the below figure we contrast our strategy with a model-free agent on the activity whose objective is to search for tea while not looking for food. Look at how the agent leveraging GPI and GPE instantly synthesises a viable policy, although it never went about learning how to consciously avoid objects. Obviously, the policy computed by GPE and GPI can be leveraged as a preliminary solution to be refined at a later point via learning, which implies that it would be matching to the final performance of a model-free agent but would likely get there quicker.
The above figure illustrates the performance of GPE and GPI on one particular activity. We have also evaluated the same agent across a broad array of activities. The figure below demonstrates what occurs when we alter the comparative criticality of tea and eating. Observe that, while the model-free agent has to learn every task on a separate basis, from the ground up, the GPE-GPI agent only goes about learning two policies and then swiftly adapts to all of the activities.
The experiments above leveraged a simplistic environment developed to exhibit the attributes required by GPE and GPI with unneeded confounding aspects. However, GPE and GPI have also had application at scale.
The work being carried out on GPE and GPI is at the crossroads of two separate branches of enquiry connected to these operations individually. The first, in connection to GPE is the research on the successor representation, started with Dayan’s breakthrough research from 1993. Dayan’s research initiated a line of work within neuroscience that is fervent with activity to this day. Lately, the successor representation propped back up in the perspective of reinforcement learning – where it is also called as “successor features” and turned into an active line of enquiry there as well. Successor features also enjoy a close relationship with general value functions, a theory based on Sutton et al.’s hypothesis that relevant know-how can be illustrated through several forecasts about the world. The exact definition of successor features has independently propped up in other contexts within reinforcement learning, and is also connected to more latest approaches typically connected with deep reinforcement learning.
The second branch of research at the origins of GPI and GPE, connected to the latter, is bothered with developing behaviours to create new behaviours. The concept of a decentralized controller that goes about executing sub-controllers has propped up several times over the years and its implementing leveraging value functions can tracked back to as far back as 24 years ago, in 1997. GPI also has a close relationship to hierarchal reinforcement learning, whose foundations were set down in the 90s and the turn of the century in the research by Dayan and Hinton, Parr and Russell, Sutton, Precup, and Singh, and Dietterich. Both the composition of behaviours and hierarchal reinforcement learning are at the present moment dynamic spheres of research.
Mehta et al. were likely the first ones to jointly leverage GPE and GPI, even though in the situation they considered GPI minimizes to a singular option at the outset (meaning, there is no stitching of policies). The variant of GPE and GPI detailed in this blog post was first put forth in 2016 as a mechanism to promote transfer learning. Transfer in reinforcement learning goes back to Singh’s research 29 years ago in 1992 and has lately witnessed a resurgence in the context of deep reinforcement learning, where it persists to be an active sphere of research.
To sum things up, a model-free agent cannot simply adapt to new scenarios, for instance to accommodate groups of preferences it has not experienced before. A model-based agent can go about adapting to any new scenario, but to do so it first has to go about learning a model of the total world. An agent on the basis of GPE and GPI provides a middle ground solution: even though the model of the world it learns is significantly smaller than that of a model-based agent, it can swiftly adapt to specific scenarios, typically with good performance.
We spoke about particular instantiations of GPE and GPI, but these, as a matter of fact are more generalized concepts. At an abstract level, an agent leveraging GPE and GPI goes forth in two steps. First, when encountered with a new activity, it questions: “How well would solutions to known activities perform on this new activity?” This is GPE. Then, on the basis of this assessment, the agent brings together the prior solutions to develop a solution for the new activity – in other words, it performs GPI. The particular mechanics behind GPI and GPE are less critical than the principle itself, and identifying alternate ways to carry out these operations may be a thrilling research direction. Fascinatingly, a new research in behavioural sciences furnishes preliminary evidence that human beings make decisions in multitask scenarios adhering to a principle that is resemblant to GPI and GPE.
The quick adaptation furnished by GPE and GPI holds promise for developing quicker learning reinforcement learning agents. In a more general sense, it is suggestive of a new approach to learning flexible solutions to issues. Rather than tackling an issue as a singular, monolithic activity, an agent can break it into smaller, more concise, sub-tasks. The solutions of the sub-tasks can then be reutilized and recombined to solve the overall activity quicker. This has the outcome of a compositional approach to reinforcement learning that may have the outcome of more scalable agents. At the very least, these agents will not be late due to a cup of tea.