Decision-making by simulating history
Reinforcement learning (RL) has been deployed for providing solutions to tasks that have a clearly defined reward function – AlphaZero for Go, OpenAI Five for DoTA, and AlphaStar for StarCraft are examples that come to mind. But a majority of real-world scenarios, there isn’t a clearly defined reward function. Even an activity that is seemingly clear cut like clearing up a room has many subtilities associated with it, should a post card with nothing on it be disposed, or does it have sentimental connotations for the person living in the room? Should dirty laundry be put in the washer, or stashed away in a closet? Where are items such as pens and paper supposed to be stored? Even when there is clarity on these subtilities, converting it into a reward is non-trivial: if rewards are given every time you clear up that room, the agent in question might create situation where the room has to be cleared again, in order to collect the reward.
As an alternative, the option is to attempt learning a reward function from human input about the behavior of the agent. For instance, Deep RL from human preferences goes about learning a reward function for comparing and contrasting video captures of the agent’s behavior. The drawback is that this approach can be quite expensive, to train MuJoCo Cheetah to dart forward needs a human being to furnish approximately 750 comparisons. That’s a lot.
An algorithm can be prescribed that can go about learning policies in the absence of any human supervisor or reward function, by leveraging data that is implicit in the state of our world. For instance, agents are educated on a policy that balances that MuJoCo cheetah on its front legs through a single state in which it is balancing.
Learning preferences by observing
Visualize an elaborate structure made out of cards. A house of cards, if you will. The immediate sensory input that’s going to occur to you here, is to exercise diligence, to be aware, so that you make sure that you don’t disturb the intricate house of cards. What precisely, however, made you aware that you shouldn’t disturb this house of cards? Going by the assumption that you’ve never encountered a scenario like that before, it cannot be previous experience that’s providing you with that sensory input. Nor can it be the influence of evolutionary training, our tribal ancestors did not constantly encounter elaborate structures constructed out of cards. It just didn’t happen. What compels agents to be vigilant is the respect one has for the effort that an individual has put into constructing this house of cards – it didn’t prop up magically, all by itself. Someone wouldn’t have gone through all this trouble unless they really cared about this project.
Preferences that are implied in the physical world generates an algorithm, Reward Learning by Simulating the Past (RLSP) that indulges in this kind of logic, enabling an agent to make the inference about human preference without overt feedback.
Let’s now take the example of a robot in a room with an ornate jug in the center of it. To the right of the jug, is a black door. Beyond the jug, is a purple door. The human agent that deploys the robot requests the robot to navigate to the purple door. If a reward function was programmed that solely rewards the robot for making it to the purple door, it would take the quickest path to the purple door, knocking over the ornate vase in the process, not realizing that it’s not the most efficient path. The human agent just hasn’t programmed the robot to not disturb the ornate vase. The robot has the conscious knowledge that its actions and trajectory in the room will break the vase, it simply doesn’t care because it doesn’t have the knowledge that it shouldn’t trip over the vase. How can we expect a robot to appreciate the value of an ornate vase unless we inform it about the value of the object, explicitly?
What RLSP does is that it provides the inference that the ornate vase in the center of the room should not be disturbed. At a high level, it essentially evaluates all the possibilities in which the past could have been, contrasts which ones are compatible with the present physical state, and learns a reward function based on the outcome. If the human agent had no concern with regards to the vase being broken, they would have done so at some point in history – they simply didn’t care about its value. The vase’s presence is evidence to the robot about the concern that human agent has for the vase, since it is intact in the room over the course of the history, it can only mean that the human agent values the vase! It implies that the vase shouldn’t be disturbed. Compare this with the human agent’s preferences about the tiles in the room. We would be subjected to the exact same state regardless of the human agent’s preferences about the tiles, so RLSP does not make any inferences about those preferences.
The drawback is that this approach needs logical reasoning about all potential versions of history, which is uncontrollable even in moderate settings. Previous research has only evaluated this idea in simplistic grid world scenarios. What is needed to scale this idea to larger, continuous environments where we do not have comprehensive information of the environmental dynamics? Based on intuition, it should still be doable to make these inferences.
Visualize a cat that is balancing itself on its hind legs. On the basis of our prior reasoning with the vase, we can only reason that there are minimal behaviors that wind up with the cat balancing itself on its hind legs. Therefore, the cat prefers to be balancing itself on its hind legs, for whatever reason.
Simulations of history
The primary challenge in scaling up RLSP to larger environments is in how one goes about reasoning “what could have taken place in the past”. To tackle this, one samples probable historical trajectories, over enumerating all potential historical trajectories. With the scenario of the cat balancing on its hind legs, one can reason that the cat must have followed a trajectory of gradually lifting its front legs, while shifting its balance to its core, until it eventually stood up on its hind legs.
Model-based RL algorithms typically simulate what’s to come by putting out a policy π(at∣st) to make choices with regards to behaviours and an environment dynamics model T(st+1∣st,at) to foretell future states. Likewise, past simulations are carried out by putting out an inverse policy π−1(at∣st+1) that foretells which behaviour at the agent undertook that had the outcome of the state st+1, and an inverse environment dynamics model T−1(st∣st+1,at) that foretells the state st through which the behaviour of choice would have caused st+1. By switching between the prediction of historical actions, and foretelling previous states from which those behaviours were undertaken, we can simulate trajectories randomly pretty far into history.
Prior to diving into the details regarding the training of models, we need to first understand how we are going to leverage the trained models to make inferences about preferences from an observed state s0.
The Deep RLSP Gradient Estimator
The RLSP algorithm leverages gradient ascent to keep updating a linear reward function on ongoing basis to describe an observed state. To scale this concept up one makes two critical modifications to their approach: a) learning a feature representation of every state, and develop the reward function as linear with regards to these features, and b) we estimate the RLSP gradient by sampling potential historical trajectories over enumerating all potential historical trajectories.
This has the outcome of the Deep RLSP gradient estimator that intends to maximize the probability of an examined state under a reward function described with a parameter vector.
At a high level, gradient computation occurs in three stages: Initially, simulation occurs in reverse to identify what could have occurred before. Then, simulation looks ahead to identify what the present policy does. Then, one engages in computation of the differences of the reverse and forward trajectories. This gradient modifies the reward conditions so that it rewards the aspects examined in the reverse trajectories, and penalizes the features examined in the trajectories that look ahead. As an outcome, in the scenario that the reward is reoptimized, the redefined policy will have a tendency to develop trajectories that are unlike the trajectories that are looking ahead and more similar to the backward trajectories.
Looking at it in another way, the gradient incentivizes a reward function so that the reverse trajectories – what must have occurred in the past, and forward trajectories (what an entity would do utilizing the present reward) are congruent with one another. After the trajectories have judged to be consistent, the gradient turns to zero, and we learn a reward function that is probably going to have the outcome of the examined state.
The foundation of the algorithm is to execute gradient ascent leveraging this gradient. However, we require ϕ, π−1, T−1, π and T to calculate the gradient. We are educated on these models through a preliminary dataset of environmental interactions. This does not include any human inputs; we leverage rollouts of an arbitrary policy to generate it. We can then engage in learning of the relevant models as follows:
- The feature function can be trained through application of any self-supervised representation learning strategy. We leverage a Variational Autoencoder (VAE).
- The forwards policy training occurs leveraging deep RL. We leverage Soft-Actor-Critic (SAC).
- The forward environment dynamics do not require learning, as we have the option of a simulator for the environment.
- The inverse policy training leveraging supervised learning on (s,a,s′) transitions obtained when carrying out π.
- The inverse environment dynamics training is done leveraging supervised learning on transitions in D.
Deep RLSP in MuJoCo
To evaluate the algorithm, applications to activities in the MuJoCo simulator takes place. These settings are typically used to benchmark RL algorithms, and a common activity would be making simulated machines walk.
To test Deep RLSP, RL is leveraged in training policy that walk, run, or move forward, and then sample a singular state from policy. Deep RLSP must subsequently leverage just that state to make the inference that it is expected to make the simulate machine move forward. It should be observed that this activity is a bit simpler than it appears as the state info in MuJoCo not just contains joint positions, but in addition, velocities, so a singular state also imparts some data with regards to how the machine is moving.
These experiments demonstrate that this functions reasonably well. It was evaluated mainly for a cheetah robot and a hopper robot, and in both scenarios, it did indeed learn to move forwards. Obviously, policies that were learned don’t have worse performance in addition to policies that overtly receive training on the real reward function.
The foundation of Deep RLSP is to engage in learning in scenarios where there isn’t an explicit reward. Therefore, as a more compelling test scenario, we leverage Deep RLSP to mimic behaviors from a singular state that are difficult to overtly define within a reward function. A set of “abilities” was produced leveraging an unsupervised skill discovery algorithm called DADS, which included the balancing ability that was observed earlier, with the cat. We sample a single state or a minimal number of states, and evaluated if Deep RLSP could learn to imitate that specific skill.
As we don’t have access to a real reward function for balancing, we don’t have an overt way to quantitatively test the performance of Deep RLSP. Alternatively, we observed video footage of the learned policies and evaluated them qualitatively.
Our preliminary evaluation and analysis of Deep RLSP is promising, there is a lot of pending work to learn preferences from the state of the external environment.
The primary necessity for Deep RLSP to function well is learning good models of the reverse environment dynamics and inverse policies, and good feature functions. In the MuJoCo environments, we were reliant on simplistic representation learning with regards to the feature functions and supervised learning for the models. This strategy is not likely to function for much larger environments or real-world robotics applications. Optimism, however, prevails in that progress in model-based RL can have direct application to this issue.
Another open question is learning preferences from the state of the world in an environment with several agents. Usually, the state will undergo optimization by one or more human agents, and we desire another robot to undergo learning of these preferences. Deep RLSP presently learns the human agent’s rewards and policy, however, in the end, we wish to leverage that to inform the robot’s behavior. In the experimental setting, human and robot were the same, and therefore, we could leverage the inferred policies in a straightforward fashion like our robot policy, but this will not be translated into practical applications.
Lastly, while the concentration was on imitation learning, Deep RLSP also has potential to learn safety considerations, like “don’t disturb the vase”. The idea of learning preferences from the physical state of the environment will also be beneficial in the application of RL in safety-oriented environments.