Learning through human feedback
The hope is that Artificial Intelligence will be one of the most critical and broadly advantageous scientific progressions ever made, assisting humanity manage some of its biggest challenges, ranging from climate change to delivering sophisticated healthcare. However, for artificial intelligence to back up its promise, we are aware that the technology must be constructed in a responsible fashion and that we must look at all possible challenges and risks.
DeepMind has co-founded initiatives such as the Partnership on Artificial Intelligence to Benefit People and Society and possesses a unit devoted to technical Artificial Intelligence safety. Work in this domain requires to remain open and collaborative to make sure that best practices are taken up and upheld as broadly as feasible, which is precisely why DeepMind also have a collaboration with OpenAI on research in technical Artificial Intelligence safety.
A core question in this domain is how we permit human beings to inform a system what we wish for it to do and critically – what we don’t desire it to do. This is more and more critical as the issues we handle with machine learning become more complicated and are leveraged in the real world.
The first outcome from the collaboration illustrates on technique to tackle this, by enabling human beings with no specialist technical expertise to train a reinforcement learning (RL) system – an artificial intelligence that goes about learning through the process of trial and error – a complicated objective. This eliminates the requirement for the human being to specify an objective for the algorithm beforehand. This is a critical step as getting the objective even a tad incorrect could have undesirable or even hazardous behaviour. In some scenarios, as minimal as 30 minutes of feedback from a non-specialist is adequate to train the system, which includes teaching it completely new complicated behaviours, like how to make a simulated robot perform backflips.
The framework – detailed in the paper Deep Reinforcement Learning from Human Preferences – deviates from conventional reinforcement learning systems by training the agent from a neural network referred to as the ‘reward predictor’, instead of rewards it obtains as it explore a situation, scenario, or environment.
It is made up of three processes which are operating in parallel to each other:
- A reinforcement learning agent explores and interacts with its environment, like an Atari game.
- Periodically, a pairing of 1-2 second clips of its behaviour is transmitted to a human agent, who in turn is requested to choose which one best shows steps toward satisfying the desired objective.
- The human’s option is leveraged in training a reward predictor, which subsequently undertakes training of the agent. With the passage of time, the agent goes about learning to maximize the reward from the predictor and enhance its behaviour in accordance with the human agent’s preferences.
The system separates learning of the goal from the learning the actions required to accomplish it.
This iterative strategy to learning implies that a human can identify and rectify any unwanted behaviours, a critical portion of any safety system. The design additionally does not exert a burden on the human agent, who just has to review approximately 0.1% of the agent’s behaviour to get it to perform what they desire. However, this can imply review of several hundred to several thousand pairings of clips, something that will require to be minimized to make it relevant to real world issues.
In the Atari title Enduro, whose gameplay consists of steering a vehicle to overtake a line of others and is very tough to learn by the trial and error strategies of a conventional reinforcement learning network, human feedback ultimately facilitated the system to accomplish superhuman results. In other titles and simulated robotics activities, it featured performance comparable to a conventional reinforcement learning setup, while in a duo of titles like Qbert and Breakout it did not function at all.
But the eventual purpose of a framework such as this is to enable human agents to specify an objective for the agent, even if it does not exist in the immediate environment. To evaluate this, agents were taught about several unique behaviours such as executing a backflip, walking on a single leg or learning to drive alongside another vehicle in Enduro, instead of overtaking to maximize the reward.
The normal objective of Enduro is to get past as many vehicles as possible, on the road. (overtake them) However, leveraging this system, the researchers can train the agent to go after an alternative objective, like driving alongside other vehicles on the road.
Even though these evaluations demonstrated some promising outcomes, others displayed its restrictions. Specifically, the setup was vulnerable to reward hacking, or gaming its reward function – if human feedback was stopped earlier on in the training. In this situation, the agent persists in exploring the environment, implying that the reward predictor is forced to estimate rewards for scenarios it has obtained nil feedback on. This can cause it to overpredict the reward, compelling the agent to learn the incorrect, often bizarre behaviours. An instance of this is the video game Pong, where the agent learned that it was better to hit the ball back and forth with its competitor rather than attempt to score a point by making a shot that makes the opponent miss the ball.
Comprehending flaws such as these is critical to make sure we prevent failures and develop artificial intelligence systems that act as intended.
There is still a lot more research to be performed to evaluate and improve this system, but already it displays a number of critical first steps in generating systems that can be taught by non-specialist users, are economical with the amount of feedback they require, and can be scaled to a plethora of issues.
Other spheres of exploration could consist of minimizing the amount of human input required or providing humans the capability to provide feedback via a natural language interface. This would signify a step-change in developing a system that can simply learn from the intricacy of human behaviour, and a critical step in the pursuit of developing an artificial intelligence that functions, operates, and works with and for all of the human race.