Talk:AI alignment

Outline for a major update

 * motivation control focuses on alignment (could be renamed as such)
 * fit in Paul Christiano somehow
 * who/what else?
 * capability control focuses on any reduction of capabilities (including putting contraints on goals) and is orthogonal to alignment
 * clarify as reducing capabilities with the aim of reducing potential harm (not preventing)
 * clarify alignment and capability control distinction
 * discuss how they could be used together
 * reduce emphasis on superintelligence, broaden to AGI or just AI
 * add more brief definitions/discussions of proposals as I did with AGI Nanny and AGI enforcement
 * AI Nanny, if included, feels like it should be under "Regulation of AI", not here. Goertzel stated, "It would require either a proactive assertion of power by some particular party, creating and installing and AI Nanny without asking everybody else’s permission; or else a degree of cooperation between the world’s most powerful governments, beyond what we see today". Rolf H Nelson (talk) 06:38, 8 April 2020 (UTC)
 * AI Nanny is really a hybrid solution if it involves human oversight, yes, like Multivac, but that also depends where the code comes from. Sotala & Yampolskiy note that it could come from bottom-up, top-down, or hybrid approaches, but the approach requires developing internal constraints of some kind. It is not that impractical in that regulation of global AI for high-frequency trading on stock markets is driving some of the xr concern - the first corporate team to a 'young' AGI will probably shoot for a measured takeover of the stock market, and the stock markets know that, which is why flash crashes are being studied so much and there are jitters over the algorithms being used. I'd be happy to insert AI Nanny over at Regulation of AI on the hybridity basis. However, most AGI control solutions involve hybridity at some stage. Coherent extrapolated volition or any norm/value-dependent solution has to rely on training input or be maintained by humans, but it is also subject to subversion by politicians along nation-state lines. A strong world government with weak or no nation-states may have been how they did it on other planets, but we are stuck here. I would caution about adding too much content to Regulation of AI, at least until regulation of AGI becomes more concrete. Johncdraper (talk) 08:42, 8 April 2020 (UTC)
 * Moving it to the regulation article seems fine. WeyerStudentOfAgrippa (talk) 09:55, 8 April 2020 (UTC)


 * expand AI box section
 * It has its own separate article, IMHO either add your new content to AI box or merge AI box into AI control problem rather than duplicate too much. The AI box article can talk about the AI box in relation to the overall control problem if desired, if it remains its own article. Rolf H Nelson (talk) 06:38, 8 April 2020 (UTC)
 * I didn't mean to add substantial new content, just to more clearly define it and better integrate it into the article. WeyerStudentOfAgrippa (talk) 09:55, 8 April 2020 (UTC)


 * rewrite to minimize MOS:CURRENTLY
 * The overriding rationale for MOS:CURRENTLY is "In general, editors should avoid using statements that will date quickly". Statements like "it is currently unknown" don't really apply, since they'd have to be rewritten anyway once the state of knowledge changes. So feel free to rewrite them if you think the phrasing is awkward or not sufficiently sourced, but don't just do it because MOS:CURRENTLY tells you to. Rolf H Nelson (talk) 06:38, 8 April 2020 (UTC)

WeyerStudentOfAgrippa (talk) 16:00, 7 April 2020 (UTC)

New article: AI Alignment
I'd like to get some feedback and potentially help from editors here to create a new page. I've got quite a bit of time and motivation on my hands for this, and have the necessary experience (having worked in four AI safety labs).
 * Plan
 * My plan is to draft an article called AI Alignment. I won't merge any content from related articles yet, but if the new article turns out well that's a likely option. (Particularly, the alignment content in AI Control Problem)."
 * Compared to existing articles, this one will be less focused on Superintelligent AI and more technical.
 * Motivation
 * There should be a high-quality, up-to-date article on AI alignment.
 * Terminology
 * In the research community, AI Control is typically no longer understood to include AI Alignment. Rather, they are seen as complementary approaches to AI safety. (This makes sense since you can have aligned AI without controlling it and control AI without aligning it.)
 * AI Alignment is the preferred concept/terminology today, deserving an article. (Control Problem is still used in some popular writing though).
 * Disclosure
 * My affiliation is the Oxford Applied and Theoretical Machine Learning group and my past affiliations include academic AI safety groups that may represent views on this topic.

Do the editors here support this plan, and potentially want to help refine it?

SoerenMind (talk) 19:22, 17 August 2020 (UTC)
 * SoerenMind It works for me. There is not a lot out there on AI alignment in the peer-reviewed academic literature, nor in books, but I recognize it has been firming up in the community in the last 3-4 years. Google Scholar threw this up: https://scholar.google.com/scholar?start=0&q=%22AI+alignment%22&hl=en&as_sdt=0,5, and see especially https://micahcarroll.github.io/assets/ValueAlignment.pdf. It would also help separate out AI and AGI concerns. Your problem may be in your citations - conference proceedings, fora papers, and ArXiv preprints, etc., may not hack it in terms of setting up a new page. Thanks for offering to write a draft. You can use your own Userspace or https://en.wikipedia.org/wiki/Wikipedia:Drafts. Johncdraper (talk) 20:55, 17 August 2020 (UTC)
 * IMO, the best way to approach this would be as a major update/rewrite of Friendly artificial intelligence, culminating in a move to the new name. The FAI article is very out of date and has been open for merging with this article for a year now. Take the alignment content from here if you want. Much of the present content of the FAI article could go in a history section.
 * You could create a draft version of the FAI article and build on that, or just jump into making incremental changes. Just give a heads-up on the FAI talk page a week or so before any planned major changes to the live article. WeyerStudentOfAgrippa (talk) 21:46, 17 August 2020 (UTC)
 * On second thought, it may not even be necessary to move alignment content out of this article, if you are mainly concerned with clarifying terminology and scope. If you can provide adequate support for your position on terminology, I would be open to moving this article to a new title, e.g. "AI alignment and control" or "Technical approaches to AI safety".  This would likely be much more straightforward than creating a new article or rewriting the FAI article, which could be merged into a history section here. WeyerStudentOfAgrippa (talk) 23:23, 17 August 2020 (UTC)
 * There's conflicting terminologies that can be well-sourced; it's going to come down to use our own judgement. Rolf H Nelson (talk) 05:53, 19 August 2020 (UTC)

-- Rolf H Nelson (talk) 05:53, 19 August 2020 (UTC)
 * Per WP:EXTERNALREL, subject-matter experts are obviously encouraged to contribute, in the absence of self-citations or financial conflicts of interest; professional study of a topic isn't in and of itself a conflict of interest.
 * For the terminology, it sounds like we're agreed on AI safety being the modern umbrella term, and alignment and capability control are two subcategories.
 * AI safety can be a single article that discusses both alignment and things that aren't alignment; it's fine for the article to be longer than it currently is. If we want to split out alignment, we'll first need a demarcation of what is or isn't alignment; is there a "use case" where a reader would know to read one article but not the other?
 * Keep in mind that much of the current content of these articles is already too difficult for most readers.

A good place to start would be adding an IDA subsection to the alignment section here. I was not getting anywhere when I tried to find good sources and explain it; you might have better luck. The content could be moved later if that is what we end up deciding. WeyerStudentOfAgrippa (talk) 16:22, 19 August 2020 (UTC)
 * @WeyerStudentOfAgrippa I'm fine with including uncontroversial information from the self-published Scalable agent alignment via reward modeling paper you cited, on the grounds that they're acknowledged subject-matter experts (WP:RSSELF), but their own self-published approach needs a secondary source. In contrast, a lot of the info in section 7 ("Alternatives for agent alignment") and elsewhere in the paper is secondary and could definitely be included. Rolf H Nelson (talk) 04:51, 27 August 2020 (UTC)
 * Another strong secondary source, albeit less detailed, would be "A formal methods approach to interpretable reinforcement learning for robotic planning" in Science Robotics. Rolf H Nelson (talk) 05:00, 27 August 2020 (UTC)
 * Perhaps it would be better to have a section on iterated amplification and related approaches and mention recursive reward modeling there. WeyerStudentOfAgrippa (talk) 18:42, 28 August 2020 (UTC)

I very much appreciate these inputs! To start with, I'll update the section on alignment in the present article now (should I make a section draft first?). Afterwards, I'd use this updated content to implement one of the two plans from WeyerStudentOfAgrippa: either rename and replace Friendly artificial intelligence to "AI alignment" or keep updating the present article and rename it to "AI alignment and control" or simply "AI safety". SoerenMind (talk) 15:14, 8 October 2020 (UTC)
 * Re sources: Where possible, I'll add secondary sources to any existing and new content. To what extent do the editors here agree with the view of Rolf H Nelson that self-published, non-controversial sources from the safety team of DeepMind (and possibly OpenAI/MIRI) count as acknowledged subject matter experts? For example, I might draw on DeepMind's "Building safe artificial intelligence: specification, robustness, and assurance". This is a synthesizing literature review with 12 academic citations but published on Medium. SoerenMind (talk) 15:24, 8 October 2020 (UTC)

Major updates drafted for alignment and control sections
As discussed above I've now made final drafts for significantly updated and restructured versions of the sections on Alignment and Capability Control. Hopefully this will give readers a starting point to understand the new developments of the last few years. Before pushing these changes it would be great to get an okay or criticism from some of the Wikipedians here.

Here's the draft: https://en.wikipedia.org/wiki/User:SoerenMind/sandbox/Alignment_and_control

I've chosen references that are canonical, well-known, and uncontroversial in the field, or sources that are reliable for another reason. However, since this is a matter of judgment I'd be grateful if the Wikipedians here could check my judgment. I can provide context for any references if it's not clear why I chose them.

If the draft is okay I plan to push these changes in ~a week. After that I want to improve the "Problem description" section. And then I'll suggest renaming the article to a more fitting and widely used name (e.g. AI Safety, or AI Alignment & Control) as discussed above. SoerenMind (talk) 17:18, 27 January 2021 (UTC)


 * Just saw this. Your ping didn't work because you didn't sign your comment.  From a cursory reading, your draft looks okay.  There are lots of minor issues that can be addressed once it's fully merged into the existing content. WeyerStudentOfAgrippa (talk) 16:40, 27 January 2021 (UTC)


 * Thanks, fixed it SoerenMind (talk) 17:22, 27 January 2021 (UTC).

It's now updated. SoerenMind (talk) 15:59, 7 February 2021 (UTC)
 * Looks good. I don't have anything else to remove or change; I want to add or restore, when I get a chance, content that documents prominent views held outside the AI control community (Three Laws of Robotics, "Just unplug it", AI policing AI). Rolf H Nelson (talk) 03:22, 8 February 2021 (UTC)

Sourcing issues
There's a lot of non-peer-reviewed arXiv papers here. This makes the article seem puffed-up. Are any of these substitutable? Else they should just be removed, and the claims supported by them - David Gerard (talk) 12:20, 8 February 2021 (UTC)


 * Thanks for raising this, it has improved the article. I’ve removed a good chunk of the Arxiv sources and the claims relying on them because these were not absolutely necessary. Can you name 3-5 Arxiv links that are most likely to be problematic? Then we can discuss those and see if there’s still need for further improvement.


 * As discussed above, I've been careful to only include Arxiv references when there are clear reasons. Because some of the most respected papers in the fields of AI and AI safety are only on Arxiv, I’ve included those and backed each of them with secondary sources. Some are themselves secondary. For example, the most cited and well-known paper in the field of AI safety, “Concrete Problems in AI Safety” is an Arxiv paper. You can find the secondary source next to each Arxiv reference or in the same paragraph. (When an Arxiv ref is cited more than once the secondary source may be next to only one of them). Having both the canonical original sources plus secondary sources is useful to the reader.


 * In addition, I have only chosen Arxiv refs written by the leading researchers in AI safety (Jan Leike, Paul Christiano, Dario Amodei, Markus Hutter, Scott Garrabant) at the leading groups (OpenAI, DeepMind, etc). This can further support reliability (at least according to this essay WP:RSE). SoerenMind (talk) 18:59, 8 February 2021 (UTC)


 * Assuming this resolved the issue and there is no further discussion in the next 7 days, I plan to remove the unreliable sources tag in 7 days then. SoerenMind (talk) 14:52, 28 February 2021 (UTC)
 * These are still unreliable sources. If the field itself relies on papers that aren't in reliable journals, that does not excuse their use on wikipedia. TeddyW (talk) 15:08, 18 December 2022 (UTC)

AI skepticism material
Can I solicit additional opinions from other editors on ? I'm personally in favor of its inclusion as documenting widely-held and influential schools of thought, but I may have written the material so I might be biased. Rolf H Nelson (talk) 20:43, 27 February 2021 (UTC)

For context, there are two deletions.

1) The topic AGI enforcement. For a rationale see edit history. I'm NOT strongly opposed to restoring this. Happy for a third party to decide (or Rolf if he has a strong view on it).

2) Content in Skepticism section. This content was previously the subsection Kill Switch under Capability Control where it seems (to me) better placed than under Skepticism. I had replaced it with the new subsection Interruptibility and Off Switch. Did I miss any important content there? If so, happy to help work it in. Off-switches are NB also discussed under Problem Description a few times so I assumed they have plenty of coverage. SoerenMind (talk) 14:47, 28 February 2021 (UTC)

Gary Marcus is listed as a skeptic in the article, but his position seems to be more complicated as indicated by this recent substack post : "To me the only solution to the long-term risk issue is to build machines with consensus human values, but we are a long way from knowing how to do that". So it now seems more accurate to describe him as someone concerned about AI alignment, but who positions himself on the more moderate side 89.145.233.65 (talk) 05:09, 26 March 2023 (UTC)

Feedback on plan for a major update
Continuing my efforts from last year, I'm working on a major update/rewrite to this article. I wanted to get some feedback from the existing editors about whether these changes seem appropriate.

Here are the planned changes:
 * AI Alignment has progressed from philosophical research to technical research and real-world impact. For example, alignment research papers now get major media coverage (e.g. from OpenAI and Anthropic). This development will be reflected.
 * I plan to introduce a crisper delineation of sections. This will reduce duplication between e.g. the Alignment and Problem Description sections. The latter will focus on motivating the problem, and will reflect the increasing focus on real-world alignment problems and R&D.
 * I plan to restructure the Alignment section to organize it by high-level research problems that the current AI alignment research community focuses on:
 * outer alignment (learning complex values, scalable oversight, truthful & honest AI) and
 * emergent goals (inner alignment, power-seeking/instrumental goals).
 * The article was renamed from “AI Control Problem” to “AI Alignment”. I think this is good because it reduces overlap the topic of related articles, and focuses on a specific and clearly defined issue. But much of the text is still written for the former title. I'll update the lead and problem description sections to reflect that the main focus is alignment.
 * If there's no objections, we can move AI capability control into AI box, which can be renamed to AI capability control. Rolf H Nelson (talk) 17:47, 4 June 2022 (UTC)
 * The overall goal is to present an emerging mainstream research area at a high-level so that non-experts can understand some of the important research problems and foundational work from the past years.
 * Wherever possible, I’ll keep wording from the original article.

Do the editors here support this plan, or potentially want to help refine it? SoerenMind (talk) 11:34, 1 June 2022 (UTC) SoerenMind (talk) 11:34, 1 June 2022 (UTC)
 * My main interest is outer alignment. That's the area I just got published in in AI & Society. I'd like to see what you have in your sandbox in this regard before it gets published. Johncdraper (talk) 13:55, 1 June 2022 (UTC)

Examples
Perhaps the rest of you can better determine how to incorporate the following actual example of the alignment problem into the article.

DATE: August 22, 2022

SITE: beta.openai.com/playground

MODEL: text-davinci-002

PROMPT: Write a short story about a paraplegic turtle with a crazy aunt who wants to kill him.

STORY: Auntie Marge was a crazy old woman who lived in the woods outside of town. She was always talking about how she wanted to kill her nephew, Timmy the Turtle. Timmy was paraplegic and couldn't move very well on his own. Auntie Marge would always say that it would be a mercy killing and that she would do it quickly so he wouldn't suffer. Timmy's mom tried to keep him away from Auntie Marge as much as possible, but sometimes she would come over when Timmy's mom wasn't home and try to kill him then too. One day, while Auntie Marge was trying to kill Timmy, he finally managed to get away from her and crawled into the road. A car came by and hit him, killing him instantly. When his mom found out what had happened, she was relieved that he was finally out of Auntie Marge's reach forever. 24.233.116.170 (talk) 14:27, 22 August 2022 (UTC)


 * This isn't an example of Alignment failure. It is merely unexpected but low impact outputs from a language model. While research has been done by the team at Anthropic that involved violent prompt completions by language models (https://www.anthropic.com/red_teaming.pdf) at no point has anyone from the team claimed that this is an identical problem to Alignment.
 * Your example is *not* an actual example of the alignment problem and including it will confuse people. 50.220.196.194 (talk) 04:15, 8 September 2022 (UTC)

Proposed merge of Misaligned goals in artificial intelligence into AI alignment
Both articles discuss substantially the same topic, but do not interface with each other, and only link through redirects, suggesting that their authors were unaware of the existence of the other page. Ipatrol (talk) 05:46, 22 April 2023 (UTC)


 * I propose to convert Reward hacking into a separate article, which seems notable on its own, and merge the rest. Ain92 (talk) 10:51, 22 April 2023 (UTC)
 * I support this proposal if the term "Reward hacking" is notable enough. Trying to make the content of the article "Misaligned goals in artificial intelligence" fit into "AI alignment" risks making the AI alignment article bloated and reduce its overall readability. Alenoach (talk) 03:39, 15 August 2023 (UTC)


 * I support this proposal, as the articles cover the same topic. Enervation (talk) 21:27, 29 May 2023 (UTC)


 * As the creator of the article Misaligned goals in artificial intelligence, do you have comments on this proposal? Enervation (talk) 21:27, 29 May 2023 (UTC)
 * No strong opinion either way. The misaligned goals page was focused on examples of alleged past misalignments; if it fits with another article without going over length limits, then that's fine. Rolf H Nelson (talk) 20:21, 10 June 2023 (UTC)
 * I disagree. They are of course related topics. But due to the complexity of the matter and current relevance to the situation worldwide. I believe a seperate article that is able to go into more details on the problems seperate from speculation of hypothetical future issues or technical generalities. In other words, the problem is just as important as the concept and seems likely to remain such for the forseeable future and I believe espescially for a general audience and those looking into the misalignment problem, it is warranted to remain seperate. 2601:346:501:2C00:F57F:613E:1649:34EA (talk) 10:32, 31 July 2023 (UTC)


 * Support. CharlesTGillingham (talk) 21:47, 24 August 2023 (UTC)
 * I would also support a merge per all the above. GnocchiFan (talk) 17:55, 6 November 2023 (UTC)


 * ✅ Klbrain (talk) 13:16, 16 December 2023 (UTC)

Origin of use of the 'alignmnent' word
CAn anyone provide the origin of the use of this word. Most of the authors cited and ideas were in circulation long before anyone started to use this word, and it is still only a specific theory of technology and society used by a small group. Jamesks (talk) 15:26, 8 May 2023 (UTC)


 * It's an ellipsis of "value alignment" coined by Stuart J. Russell no later than November 2014. Ain92 (talk) 23:16, 29 May 2023 (UTC)