User:SoerenMind/sandbox/Alignment and control

New article name goes here new article content ...

Problem Description
As in existing article AI control problem

Alignment
The main approach to preventing such problems from arising within superintelligent AIs is to ensure that their goals are aligned with human values, so that they won’t pursue undesirable outcomes. However, experts do not currently know how to reliably develop AIs which possess specific abstract goals or values. Ongoing research aims to address a range of problems in the field.

The scope of alignment
Research on alignment varies by the scope of behaviour it aims to train AIs to achieve; OpenAI researcher Paul Christiano distinguishes two broad categories. Narrowly aligned AIs can carry out tasks in accordance with the user’s instrumental preferences, without necessarily understanding the user’s long-term goals. Narrow alignment can apply to AIs with general capabilities, but also to AIs that are specialised for individual tasks. For example, we would like question-answering systems to respond to questions truthfully without selecting their answers to manipulate humans or bring about long-term effects.

By contrast, ambitious alignment involves encoding the correct or best scheme of human values into AIs that are able to act autonomously at a large scale, which requires addressing moral and political problems. For example, in Human Compatible, Berkeley professor Stuart Russell proposes that AI systems be designed with the sole objective of maximizing the realization of human preferences. The "preferences" Russell refers to "are all-encompassing; they cover everything you might care about, arbitrarily far into the future." AI ethics researcher Iason Gabriel argues that we should align AIs with “principles that would be supported by a global overlapping consensus of opinion, chosen behind a veil of ignorance and/or affirmed through democratic processes.” Eliezer Yudkowsky of the Machine Intelligence Research Institute has proposed the goal of fulfilling humanity’s coherent extrapolated volition (CEV), roughly defined as the set of values which humanity would share at reflective equilibrium, i.e. after a long, idealised process of refinement.

Specifications of AI goals
We can phrase the goals of alignment in terms of the following distinction between three different types of specification:
 * ideal specification (the “wishes”), corresponding to the hypothetical (but hard to articulate) description of an ideal AI system that is fully aligned to the desires of the human operator;
 * design specification (the “blueprint”), corresponding to the specification that we actually use to build the AI system, e.g. the reward function that a reinforcement learning system maximises;
 * revealed specification (the “behaviour”), which is the specification that best describes what actually happens, e.g. the reward function we can reverse-engineer from observing the system’s behaviour using, say, inverse reinforcement learning. This is typically different from the one provided by the human operator because AI systems are not perfect optimisers or because of other unforeseen consequences of the design specification.

[ADD COAST RUNNERS GIF]

AI alignment researchers aim to ensure that the revealed specification matches the ideal specification, by creating the best design specification for building the AI. A mismatch between the ideal specification and the design specification is known as outer misalignment, whereas a mismatch between the design specification and the revealed specification is known as inner misalignment. Outer misalignment might arise because of mistakes in specifying the objective function (design specification). For example, a reinforcement learning agent trained on the game of CoastRunners learned to move in circles while repeatedly crashing, which got it a higher score than finishing the race (see animated figure).

Inner misalignment arises when the agent pursues a goal that is aligned with the design specification on the training data but not elsewhere. This type of misalignment is often compared to human evolution: evolution selected for genetic fitness (design specification) in our ancestral environment, but in the modern environment human goals (revealed specification) are not aligned with maximizing genetic fitness. For example, our taste for sugary food, which originally increased fitness, today leads to overeating and health problems. Inner misalignment is a particular concern for agents which are trained in large open-ended environments, where a wide range of unintended goals may emerge.

Scalable oversight
One approach to preventing misspecified objective functions is to ask humans to evaluate and score the AI’s behaviour. However, humans are also fallible, and might score some undesirable solutions highly - for instance, a virtual robot hand shown on the right learned to ‘pretend’ to grasp an object to get positive feedback. And thorough human supervision is expensive, meaning that this method could not realistically be used to evaluate all actions. Additionally, complex tasks (such as making economic policy decisions) might produce too much information for an individual human to evaluate. And long-term tasks such as predicting the climate cannot be evaluated without extensive human research. The pitfalls of using feedback from unassisted humans are illustrated by AI systems that use 'likes' or click-throughs as human feedback, which may lead to addiction.

A key open problem in alignment research is how to create a design specification which avoids outer misalignment, given only limited access to a human supervisor - known as the problem of scalable oversight. Much ongoing research in AI alignment attempts to address this issue; some of the most prominent research agendas are discussed below.

[ADD ROBOT HAND GIF]

Training by debate
OpenAI researchers have proposed training aligned AI by means of debate between AI systems, with the winner judged by humans. Such debate is intended to bring the weakest points of an answer to a complex question or problem to human attention, as well as to train AI systems to be more beneficial to humans by rewarding them for truthful and safe answers. This approach is motivated by the expected difficulty of determining whether an AGI-generated answer is both valid and safe by human inspection alone. Joel Lehman characterizes debate as one of “the long term safety agendas currently popular in ML”, with the other two being reward modelling and iterated amplification (see below).

Reward modeling
Reward modeling refers to a system of reinforcement learning in which an agent receives rewards from a model trained to imitate human feedback. In reward modeling, instead of receiving reward signals directly from humans or from a static reward function, an agent receives its reward signals through a human-trained model that can operate independently of humans. The reward model is concurrently trained by human feedback on the agent's behavior during the same period in which the agent is being trained by the reward model.

In 2017, researchers from OpenAI and DeepMind reported that a reinforcement learning algorithm using a feedback-predicting reward model was able to learn complex novel behaviors in a virtual environment. In one experiment, a virtual robot was trained to perform a backflip in less than an hour of evaluation using 900 bits of human feedback. In 2020, researchers from OpenAI described using reward modeling to train language models to produce short summaries of Reddit posts and news articles, with high performance relative to other approaches. However, they observed that beyond the predicted reward associated with the 99th percentile of reference summaries in the training dataset, optimizing for the reward model produced worse summaries rather than better.

A long-term goal of this line of research is to create a recursive reward modelling setup for training agents on tasks too complex or costly for humans to evaluate directly. For example, if we wanted to train an agent to write a fantasy novel using reward modelling, we would need humans to read and holistically assess enough novels to train a reward model to match those assessments, which might be prohibitively expensive. But this would be easier if we had access to assistant agents which could extract a summary of the plotline, check spelling and grammar, summarize character development, assess the flow of the prose, and so on. Each of those assistants could in turn be trained via reward modelling.

The general term for a human working with AIs to perform tasks that the human couldn’t by themselves is an amplification step, because it amplifies the capabilities of a human beyond what they would normally be capable of. Since recursive reward modelling involves a hierarchy of several of these steps, it’s one example of a broader class of safety techniques known as iterated amplification. In addition to techniques which make use of reinforcement learning, other proposed iterated amplification techniques rely on supervised learning, or imitation learning, to scale up human abilities.

Inferring human preferences from behaviour
Stuart Russell has advocated a new approach to the development of beneficial machines, in which:

"1. The machine’s only objective is to maximize the realisation of human preferences.

2. The machine is initially uncertain about what those preferences are.

3. The ultimate source of information about human preferences is human behaviour."

An early example of this approach is Russell and Ng’s inverse reinforcement learning, in which AIs infer the preferences of human supervisors from those supervisors’ behaviour, by assuming that the supervisors act to maximise some reward function. More recently, Hadfield-Menell et al. have extended this paradigm to allow humans to modify their behaviour in response to the AIs’ presence (for example, by favouring pedagogically useful actions), which they call “assistance games” (also known as cooperative inverse reinforcement learning). Compared with debate and iterated amplification, assistance games rely more explicitly on the assumption of (noisy) human rationality; it is unclear how to extend them to cases in which humans are systematically biased or otherwise suboptimal.

Embedded agency
Work on scalable oversight largely occurs within formalisms such as POMDPs. Embedded agency is another major strand of research, which attempts to solve problems arising from the mismatch between such theoretical frameworks and real agents we might build. For example, even if the scalable oversight problem is solved, an agent which is able to gain access to the computer it is running on may still have an incentive to tamper with its reward function in order to get much more reward than its human supervisors give it. A list of examples of specification gaming from DeepMind researcher Viktoria Krakovna includes a genetic algorithm that learned to delete the file containing its target output so that it was rewarded for outputting nothing. This class of problems has been formalised using causal incentive diagrams. Everitt and Hutter’s current reward function algorithm addresses it by designing agents which evaluate future actions according to their current reward function. This approach is also intended to prevent problems from more general self-modification which AIs might carry out.

Other work in this area focuses on developing new frameworks and algorithms for other properties we might want to capture in our design specification. For example, we would like our agents to reason correctly under uncertainty in a wide range of circumstances. As one contribution to this, Leike et al. provide a general way for Bayesian agents to model each other’s policies in a multi-agent environment, without ruling out any realistic possibilities. And the Garrabrant induction algorithm extends probabilistic induction to be applicable to logical (rather than only empirical) facts.

Approaches to inner alignment
An inner alignment failure occurs when the goals an AI pursues during deployment (its revealed specification) deviate from the goals it was trained to pursue in its original environment (its design specification). Paul Christiano argues for using interpretability to detect such deviations, using adversarial training to detect and penalize them, and using formal verification to rule them out. These research areas are active focuses of work in the machine learning community, although that work is not normally aimed towards solving AGI alignment problems. Building on early adversarial examples for image classifiers, a wide body of literature now exists on techniques for generating adversarial examples, and for creating models robust to them. Meanwhile research on verification includes techniques for training neural networks whose outputs provably remain within identified constraints. Interpretability research will be discussed in more detail in the Capability Control section.

Capability control
Capability control proposals aim to increase our ability to monitor and control the behaviour of AI systems, in order to reduce the danger they might pose if misaligned. However, capability control becomes less effective as our agents become more intelligent and their ability to exploit flaws in our control systems increases. Therefore, Bostrom and others recommend capability control methods only as a supplement to alignment methods.

Interruptibility
One potential way to prevent harmful outcomes is to give human supervisors the ability to easily shut down a misbehaving AI via an “off-switch”. However such AIs will have instrumental incentives to disable any off-switches, unless measures are put in place to prevent this. This problem has been formalised as an assistance game between a human and an AI, in which the AI can choose whether to disable its off-switch; and then, if the switch is still enabled, the human can choose whether to press it or not. A standard approach to such assistance games is to ensure that the AI interprets human choices as important information about its intended goals.

Alternatively, Laurent Orseau and Stuart Armstrong proved that a broad class of agents, called safely interruptible agents, can learn to become indifferent to whether their off-switch gets pressed. This approach has the limitation that an AI which is completely indifferent to whether it is shut down or not is also unmotivated to care about whether the off-switch remains functional, and could incidentally and innocently disable it in the course of its operations (for example, for the purpose of removing and recycling an unnecessary component). More broadly, indifferent agents will act as if the off-switch can never be pressed, and might therefore fail to make contingency plans to arrange a graceful shutdown.

Interpretability and analysis
Analysis of the mechanisms underlying an AI’s behaviour can help to identify when that behaviour will have undesirable consequences. The main challenge is that neural networks are by default highly uninterpretable, and are often described as “black boxes”. Approaches to addressing this operate at multiple levels. Some techniques allow visualisations of the inputs which individual neurons respond to most strongly. Several groups have found that neurons can be aggregated into circuits which perform human-comprehensible functions, some of which reliably arise across different networks trained independently.

At a higher level, various techniques exist to extract compressed representations of the features of given inputs, which can then be analysed by standard clustering techniques. Alternatively, networks can be trained to output linguistic explanations of their behaviour, which are then directly human-interpretable. Model behaviour can also be explained with reference to training data - for example, by evaluating which training inputs influenced a given behaviour the most.

Boxing
An AI box is a proposed method of capability control in which an AI is run on an isolated computer system with heavily restricted input and output channels - for example, text-only channels and no connection to the internet. While this reduces the AI’s ability to carry out undesirable behaviour, it also reduces its usefulness. However, boxing has fewer costs when applied to a question-answering system, which doesn’t require interaction with the world in any case.

The likelihood of security flaws involving hardware or software vulnerabilities can be reduced by formally verifying the design of the AI box. Security breaches may also occur if the AI is able to manipulate the human supervisors into letting it out, via its understanding of their psychology.