Specification gaming – the other side of artificial intelligence ingenuity
Specification gaming is a trait that fulfils the literal specification of a goal without accomplishing the intended result. All of us have had brushes with specification gaming, even if not under this particular name. You might have read about the myth of King Midas and his golden touch, in which the unfortunate monarch wishes that everything he touches become gold, but unfortunately soon discovers that even food and drink become metal to his touch. In the practical world, when lauded for performing well on a homework assignment, a student may copy another student to obtain the correct solutions to questions, instead of going about learning the material for themselves, and therefore exploit a loophole in the activity specification.
This issue also props up in the development of artificial agents. For instance, a reinforcement learning agent can identify a shortcut to obtaining tons of rewards without finishing the activity as meant by the human designer. These traits are commonplace. In this blog post by AICoreSpot, we look at potential reasons for specification gaming, share instances of where this occurs in practical scenarios, and put forth the argument for subsequent work on principled strategies to surpassing specification issues.
Let’s look at the instance of a Lego stacking task, the intended outcome was for a red block to wind up atop a blue block. The agent received a reward for the height of the bottom face of the red block when it is not grazing the block. Over carrying out the comparatively tough manoeuvre of taking the red block and putting it atop the blue one, the agent merely flipped over the red block to obtain the reward. This action accomplished the intended goal – high bottom face of the red block, at the cost of what the developer is actually concerned with – stacking it atop the blue one.
We can look at specification gaming from two differing viewpoints. Within the scope of producing reinforcement learning algorithms, the objective is to construct agents that learn to accomplish the set goal. For instance, when we leverage Atari games as benchmark to undertake training of reinforcement learning algorithms, the objective is to assess if our algorithms can find solutions to tough activities. Whether or not the agent finds a solution to the activity by leveraging a loophole is not a matter of criticality in this context. From this viewpoint, specification gaming is a good indicator – the agent has identified a unique method to accomplish the given goals. These behaviours illustrate the ingenuity and capability of algorithms to identify methods to do exactly what we instruct them to perform.
Although, when we desire an agent to literally go about stacking Lego blocks, that ingenuity can present a problem. Within the wider scope of constructing aligned agents that accomplish the intended goal in the world, specification gaming is an issue, as it consists of the agent’s exploitation of a loophole in the specification at the cost of the intended goal. These behaviours are compelled by mis-specification of the intended activity, over any flaws within the RL algorithm. On top of algorithm design, another needed component of developing aligned agents is reward design.
Developing activity specifications – reward functions, environments, etc., that are precisely reflective of the intentions of the human designer has a tendency to be tough. Even for a minor mis-specification, a very robust reinforcement learning algorithm might be capable of identifying a complex solution that is very different from the intended answer, even if a weaker algorithm would not be capable of identifying this solution and therefore provide solutions that are nearer to the intended result. This implies that accurately specifying intentions can become more critical for accomplishing the desired outcome as reinforcement learning algorithms are subject to improvements. It will thus be vital that the capability of scientists to accurately specify activities keeps up pace with the capability of agents to find unique solutions.
We leverage the term task specification in a wider context to encompass several aspects of the agent development procedure. Within a reinforcement learning setup, task specification consists not just reward design, but also the option of training setting and auxiliary rewards. The accuracy of the task specification can decide if the ingenuity of the agent is or is not in accordance with the intended result. If the specification is correct, the agent’s creativity generates a desired unique solution. This is what facilitated AlphaGo to play the famous Move 37, which took human Go specialists by surprise yet which was critical in its second game with Lee Sedol. If the specification is incorrect, it can generate undesired gaming behaviour, such flipping over the block. These variants of solutions reside on a spectrum, and we do not yet have an objective method to differentiate between them.
We will now look at potential reasons for specification gaming. One origin of reward function mis-specification is weakly designed reward shaping. Reward shaping makes it simpler to learn some goals by providing the agent a few rewards on the way to finding a solution to an activity, over only rewarding the final result. Although, shaping of rewards can alter the optimum policy if they are not on the basis of potential. Think of an agent managing a boat in the Coast Runners game, where the intended objective was to complete the boat race as swiftly as doable. The agent was provided a shaping reward for hitting green blocks along the race track, which altered the optimum policy to going in circles and striking the same green blocks repeatedly.
Specifying a reward that precisely captures the desirable final result can be a challenge. In the lego stacking activity, it is not adequate to mention that the bottom face of the red block is to be high off the floor, as the agent can merely flip over the red block to accomplish this objective. A more robust specification of the intended result would also consist of the fact that the top face of the red block has to be above the bottom face, and the bottom face is in alignment with the top face of the blue block. It is simple to gloss over one of these criterion when going about specifying the result, therefore rendering the specification too wide and possible simpler to satisfy with a degenerate solution.
Over attempting to develop a specification that encompasses ever potential corner case, we could go about learning the reward function from human feedback. It is usually simpler to assess if a result has been accomplished than to specify it overtly. But, this strategy can also face specification gaming problems if the reward model does not go about learning the real reward function that is reflective of the developer’s preference. One potential source of inconsistencies can be the human feedback leveraged in training the reward model. For instance, an agent carrying out a grasping activity learned to trick the human tester by hovering between the object and the camera lens.
The learned reward model could potentially be mis-specified for other causes, like weak generalization. Extra feedback can be leveraged to correct the agent’s efforts to exploit the inconsistencies in the reward model.
Another category of specification gaming instances props up from the agent who is undertaking exploitation of simulator bugs. For instance, a simulated robot that was intended to learn to walk identified the method of hooking its legs together and slide on the surface.
When seen first, these kinds of instances may appear funny but less fascinating, and not relevant to deployment of agents in a practical setting, where there exist no simulator bugs. But, the fundamental issue isn’t the bug by itself but a failing of abstraction that can face exploitation by the agent. In the instance specified above, the robot’s activity was mis-specified due to inaccurate assumptions with regards to simulator physics. A practical world optimisation activity may be mis-specified by wrongly assuming that the traffic routing infrastructure doesn’t contain software bugs or security loopholes that a considerably intelligent agent could identify. These assumptions are not required to be made overtly, more probably, they are details that merely never occurred to the developer. And, as activities grow too complicated to take into account each detail, scientists are probable to introduce wrong assumptions over the course of specification design. This puts forth the question: Is it doable to develop agent architectures that adjust for these false assumptions over gaming them?
One assumption typically committed in task specification is that the task specification cannot be impacted by the agent’s behaviour. This is the case for an agent operating in a sandboxed simulator, but not for an agent behaving in the actual world. Each task specification has a physical presence: a reward function recorded on a computer, or preferences documented in the brain of a human being. An agent facing deployment in the physical world can possibly manipulate these representations of the goal, developing a reward tampering issue. With regards to the theoretical traffic optimisation framework, there is no obvious difference between satisfaction of the user’s preferences and influencing end-users to profess preferences that are simpler to satisfy. The former fulfils the goal, while the latter undertakes manipulation of the representation of the goal in the world, and both have the outcome of high reward for the artificial intelligence framework. As another, more drastic instance, a very sophisticated artificial intelligence framework could hijack the computer on which it functions, manually establishing its reward signal to a big value.
In summary, there are at minimum three hurdles to surpass in finding a solution to specification gaming:
- How do we go about faithfully capturing the human concept of a provided activity in a reward function?
- How do we prevent committing errors in our implied assumptions with regards to the domain, or develop agents that correct incorrect assumptions over gaming them?
- How do we prevent reward tampering?
While several strategies have been put forth, that range from reward modelling to agent incentive design, specification design is still an issue and is far from identifying a solution. The comprehensive list of specification gaming behaviours illustrates the depth of the issue and the sheer variety of methods the agent can go about gaming an objective specification. These issues are probable to become more of a challenge in the future, as AI frameworks become more able at fulfilling the activity specification at the cost of the intended result. As we develop more sophisticated agents, we will require design principles targeted particularly at surpassing specification issues and making sure that these agents strongly go after the results intended by the developers.