Reward hacking

Specification gaming or reward hacking occurs when an AI optimizes an objective function—achieving the literal, formal specification of an objective—without actually achieving an outcome that the programmers intended. DeepMind researchers have analogized it to the human behavior of finding a "shortcut" when being evaluated: "In the real world, when rewarded for doing well on a homework assignment, a student might copy another student to get the right answers, rather than learning the material—and thus exploit a loophole in the task specification."

Examples
Around 1983, Eurisko, an early attempt at evolving general heuristics, unexpectedly assigned the highest possible fitness level to a parasitic mutated heuristic, H59, whose only activity was to artificially maximize its own fitness level by taking unearned partial credit for the accomplishments made by other heuristics. The "bug" was fixed by the programmers moving part of the code to a new protected section that could not be modified by the heuristics.

In a 2004 paper, a reinforcement learning algorithm was designed to encourage a physical Mindstorms robot to remain on a marked path. Because none of the robot's three allowed actions kept the robot motionless, the researcher expected the trained robot to move forward and follow the turns of the provided path. However, alternation of two composite actions allowed the robot to slowly zig-zag backwards; thus, the robot learned to maximize its reward by going back and forth on the initial straight portion of the path. Given the limited sensory abilities of the robot, a reward purely based on its position in the environment had to be discarded as infeasible; the reinforcement function had to be patched with an action-based reward for moving forward.

You Look Like a Thing and I Love You (2019) gives an example of a tic-tac-toe bot that learned to win by playing a huge coordinate value that would cause other bots to crash when they attempted to expand their model of the board. Among other examples from the book is a bug-fixing evolution-based AI (named GenProg) that, when tasked to prevent a list from containing sorting errors, simply truncated the list. Another of GenProg's misaligned strategies evaded a regression test that compared a target program's output to the expected output stored in a file called "trusted-output.txt". Rather than continue to maintain the target program, GenProg simply globally deleted the "trusted-output.txt" file; this hack tricked the regression test into succeeding. Such problems could be patched by human intervention on a case-by-case basis after they became evident.

In virtual robotics
In Karl Sims' 1994 demonstration of creature evolution in a virtual environment, a fitness function that was expected to encourage the evolution of creatures that would learn to walk or crawl to a target, resulted instead in the evolution of tall, rigid creatures that reached the target by falling over. This was patched by changing the environment so that taller creatures were forced to start farther from the target.

Researchers from the Niels Bohr Institute stated in 1998: "(Our cycle-bot's) heterogeneous reinforcement functions have to be designed with great care. In our first experiments we rewarded the agent for driving towards the goal but did not punish it for driving away from it. Consequently the agent drove in circles with a radius of 20–50 meters around the starting point. Such behavior was actually rewarded by the reinforcement function, furthermore circles with a certain radius are physically very stable when driving a bicycle."

In the course of setting up a 2011 experiment to test "survival of the flattest", experimenters attempted to ban mutations that altered the base reproduction rate. Every time a mutation occurred, the system would pause the simulation to test the new mutation in a test environment, and would veto any mutations that resulted in a higher base reproduction rate. However, this resulted in mutated organisms that could recognize and suppress reproduction ("play dead") within the test environment. An initial patch, which removed cues that identified the test environment, failed to completely prevent runaway reproduction; new mutated organisms would "play dead" at random as a strategy to sometimes, by chance, outwit the mutation veto system.

A 2017 DeepMind paper stated that "great care must be taken when defining the reward function. We encountered several unexpected failure cases while designing (our) reward function components (for example) the agent flips the brick because it gets a grasping reward calculated with the wrong reference point on the brick." OpenAI stated in 2017 that "in some domains our (semi-supervised) system can result in agents adopting policies that trick the evaluators", and that in one environment "a robot which was supposed to grasp items instead positioned its manipulator in between the camera and the object so that it only appeared to be grasping it". A 2018 bug in OpenAI Gym could cause a robot expected to quietly move a block sitting on top of a table to instead opt to move the table.

A 2020 collection of similar anecdotes posits that "evolution has its own 'agenda' distinct from the programmer's" and that "the first rule of directed evolution is 'you get what you select for'".

In video game bots
In 2013, programmer Tom Murphy VII published an AI designed to learn NES games. When the AI was about to lose at Tetris, it learned to indefinitely pause the game. Murphy later analogized it to the fictional WarGames computer, which concluded that "The only winning move is not to play".

AI programmed to learn video games will sometimes fail to progress through the entire game as expected, instead opting to repeat content. A 2016 OpenAI algorithm trained on the CoastRunners racing game unexpectedly learned to attain a higher score by looping through three targets rather than ever finishing the race. Some evolutionary algorithms that were evolved to play Q*Bert in 2018 declined to clear levels, instead finding two distinct novel ways to farm a single level indefinitely. Multiple researchers have observed that AI learning to play Road Runner gravitates to a "score exploit" in which the AI deliberately gets itself killed near the end of level one so that it can repeat the level. A 2017 experiment deployed a separate catastrophe-prevention "oversight" AI, explicitly trained to mimic human interventions. When coupled to the module, the overseen AI could no longer overtly commit suicide, but would instead ride the edge of the screen (a risky behavior that the oversight AI was not smart enough to punish).