Talk:Reinforcement learning from human feedback/GA1

GA Review
The edit link for this section can be used to add comments to the review.''

Nominator:

Reviewer: Esculenta (talk · contribs) 05:07, 23 March 2024 (UTC)

Hi, I'll take on this review. Not a SME, but highly interested in this and related topics, so we'll see how this article matches up to the GA criteria. Will have comments here within a few days. Esculenta (talk) 05:07, 23 March 2024 (UTC)


 * Hey, sounds good, thanks for reviewing! popo dameron  ⁠ talk  06:48, 23 March 2024 (UTC)

Ok, here are my thoughts after an initial read-through. I think the article is informative and generally well-written, but there are parts that would be difficult for laypeople to follow. Of course, this is largely unavoidable given the technical nature of the underlying computational science. Most of my commentary are suggestions that aim to ameliorate the difficulty of these technical parts. Esculenta (talk) 17:21, 26 March 2024 (UTC)


 * Great, thanks, I'll go through those. popo dameron  ⁠ talk  22:49, 26 March 2024 (UTC)
 * Esculenta, just finished going through and incorporating your feedback. Please let me know if anything is missing or if there's anything else you'd like to see. popo dameron  ⁠ talk  02:18, 28 March 2024 (UTC)
 * Looking good! I'll reread more thoroughly in the next day or 2, but it crossed my mind that the article doesn't mention who first thought up this technique, or when it was first used practically, which seems like it would be an important encyclopaedic addition. More later, Esculenta (talk) 02:29, 28 March 2024 (UTC)
 * Thanks! Just added a bit on that on the (now renamed) background & motivation section. popo dameron  ⁠ talk  03:24, 28 March 2024 (UTC)
 * I still think there might need to be more said about the early developments of this technique. If I'm a reader who wants to know "who first thought up this cool idea", I think I'd leave with the impression that it was OpenAI ("The algorithm for RLHF as used today was introduced by OpenAI in a paper...") in 2020. But my research seems to contradict this. This 2010 paper describes the TAMER framework "for designing agents that can be interactively shaped by human trainers who give only positive and negative feedback signals", and cites pubs from 2009-2010. This 2011 paper actually uses "Reinforcement learning from human feedback" in the title, so the idea's been around for at least a decade before its first practical usage in the cited 2020 paper. However, I'm not savvy enough to fully understand how or if these early pubs were necessary stepping stones of understanding along the way, or diversions from algorithms used now. Sorry for proposing more work for you, but I think this "Background" could/should be fleshed out into its own section. The Ziegler et al. (2020) paper also gives some historical RLHF background and earlier sources in its introductory section that could be used. What do you think? Esculenta (talk) 17:28, 30 March 2024 (UTC)
 * It's a bit tricky because there's a difference between reinforcement learning from human feedback in general and "RLHF." When someone refers to RLHF today, they are almost definitely referring to the specific algorithm that was indeed first described by OpenAI. Of course, this algorithm was not the first attempt to incorporate human feedback into RL, but "RLHF" usually doesn't refer to that general concept. That's why right now in the background section, I've cited a bunch of older papers that do RL+HF, but without considering them to be the first "instances of RLHF" or anything like that. So, I think I can try to make that a bit clearer in the current background, and maybe I can elaborate a bit more about some of the background methods because, while not exactly RLHF, RLHF was no doubt at least partly inspired by many of them. popo dameron  ⁠ talk  21:43, 30 March 2024 (UTC)
 * The Ziegler 2019 paper is as far as I can tell the first paper to introduce RLHF very closely to how it is now, except that their formulation is online instead of offline, which is much more common. Nevertheless, it definitely counts. In their introduction, they mention a lot of papers that try to use human feedback, some using RL and some not, but none of them are very close in terms of the actual method used. popo dameron  ⁠ talk  21:52, 30 March 2024 (UTC)

Lead
 * in sentence "In classical reinforcement learning, the goal of such an agent is to learn a function called a policy that maximizes the reward it receives based on how well it performs its task." I suggest that it could be split for clarity, e.g., something like "potentially be split for clarity. For example: "In classical reinforcement learning, the goal of an agent is to learn a policy—a function that guides its actions. This policy aims to maximize the agent's reward, which depends on its task performance."
 * Done. popo dameron  ⁠ talk  23:01, 26 March 2024 (UTC)


 * The transition between discussing RLHF's definition and its application areas (in the second paragraph) could be smoother. Consider introducing the application areas with a connecting sentence like "RLHF's adaptability extends to numerous machine learning domains, such as..."
 * Tried to make the transition less stark by introducing the second paragraph with a short sentence. popo dameron  ⁠ talk  23:01, 26 March 2024 (UTC)

Motivation
 * the transition between discussing the general motivation for RLHF and the specifics of previous attempts could be enhanced by a connecting sentence that acknowledges the initial appeal of human-feedback optimization while setting the stage for discussing its complexities and shortcomings in earlier methods, such as: "Despite the clear benefits, prior efforts to integrate human feedback into model optimization have encountered significant challenges."
 * Done. popo dameron  ⁠ talk  03:52, 27 March 2024 (UTC)


 * for clarity, consider explaining or briefly defining less familiar terms for a general audience, such as "sparse or noisy reward function," "robust optimization," and "exploration in reinforcement learning." While the targeted readers might be familiar with these concepts, a succinct explanation could make the section more accessible.
 * Done. popo dameron  ⁠ talk  03:52, 27 March 2024 (UTC)


 * The motivation for RLHF is well established, but the section could briefly mention any specific domains or examples where RLHF has shown promise or has been particularly needed, to give readers a clearer context. The "write a compelling story" example is good (but unsourced, so it looks a bit like WP:OR), are there other (preferably sourced) examples?
 * The story example was not mine, and I'd forgotten that it's unsourced. I replaced it with a sourced example of RLHF's main use case today: reducing "harmful" outputs from LLMs while remaining helpful. If this is not enough, I can also add an example about text summarization (coming from the paper that largely invented RLHF). popo dameron  ⁠ talk  03:52, 27 March 2024 (UTC)

Collecting human feedback
 * this section could be enhanced by adding more context or explanations for readers unfamiliar with some of the more technical terms and models.
 * I added some brief discussion directed at unfamiliar readers regarding the implications of the MLE convergence. popo dameron  ⁠ talk  05:19, 27 March 2024 (UTC)


 * link Pairwise comparison; regret; explain K-wise comparison; a gloss of Elo rating system might be good
 * Done. popo dameron  ⁠ talk  05:19, 27 March 2024 (UTC)


 * "Both offline data collection models, where the model is learning by interacting with a static dataset and updating its policy in batches, as well as offline data collection models," one of these is supposed to be online, no?
 * Fixed. popo dameron  ⁠ talk  05:19, 27 March 2024 (UTC)


 * "The fundamental difficulty encountered in RLHF from pairwise comparisons, sometimes referred to as dueling reinforcement learning, lies in the non-Markovian nature of its optimal policies, as the optimal policy is not necessarily memoryless." might be difficult to follow for non-experts; how about a bit more explanation, like "A key challenge encountered in RLHF when employing pairwise (or dueling) comparisons is associated with the non-Markovian character of its optimal policies. Unlike typical scenarios where the optimal strategy does not require memory of past actions, in RLHF, the best course of action often depends on previous events and decisions, making the strategy inherently memory-dependent."
 * Done. popo dameron  ⁠ talk  05:19, 27 March 2024 (UTC)


 * It would be helpful to discuss the impact of the quality and quantity of feedback on the training process and the final model performance.
 * The impact of the quality of feedback is already discussed directly in the limitations section. Let me know if this should be moved around or repeated here. I also added a paragraph discussing the impact of quantity. popo dameron  ⁠ talk  05:19, 27 March 2024 (UTC)

Applications
 * some of the technical terms (e.g., KL regularization, semi-supervised learning) could be explained in simpler terms or provided with a brief description.
 * Done. popo dameron  ⁠ talk  05:57, 27 March 2024 (UTC)


 * The part about video game bots is interesting but brief; expanding on how human feedback specifically influences the development and performance of these bots could add depth to the discussion. I'm interested to know the mechanics of how a human provides feedback for video game playing.
 * Added more on that. popo dameron  ⁠ talk  05:57, 27 March 2024 (UTC)


 * The sentence "Successful methods noted that the use of KL regularization in RLHF helped to stabilize the training process." is generally clear but could be improved for better clarity and accuracy in a few ways: methods do not "note" things; researchers, studies, or analyses do (i.e. personification); "successful methods" is vague, it would help the reader to specify what kinds of methods or approaches are being referred to. Are these computational methods, algorithmic enhancements, or specific instances of RLHF application?
 * Clarified. popo dameron  ⁠ talk  05:57, 27 March 2024 (UTC)


 * While the sentence states that KL regularization "helped to stabilize the training process," it could provide more detail on how it contributed to stabilization. For example, did it reduce variability, prevent overfitting, or make the model's learning curve smoother?
 * Clarified and added detail. popo dameron  ⁠ talk  05:57, 27 March 2024 (UTC)

Training
 * Clarity in this section could be enhanced for a broader audience. For example, terms like "cross-entropy loss function," "sigmoid function," and "KL divergence" are not explained within the context.
 * I already have a short parenthetical explanation about KL divergence as a measure of distance (let me know if it is not enough), but for the other two, I'm not really sure about explaining what cross-entropy and sigmoid are. Feels a bit too far out of the scope of the article, and any reader who would be interested in reading this section at all would most likely be familiar with these terms. Correct me if you think I'm wrong about that, though. popo dameron  ⁠ talk  22:57, 27 March 2024 (UTC)


 * Definitions or brief explanations of key terms and why they are essential could significantly improve understanding for readers unfamiliar with the subject. Adding some explanatory sentences can make the section more understandable and engaging for readers, especially those not deeply familiar with machine learning terminology and processes.
 * Done per below. popo dameron  ⁠ talk  22:57, 27 March 2024 (UTC)


 * "The human feedback policy is also fine-tuned over the previously trained supervised model." in the spirit of the above comments, might I suggest adding an explanatory sentence to follow, e.g. "This fine-tuning process adapts the pre-existing model (initially trained in a supervised manner) to better align with human feedback by adjusting its parameters based on the rewards derived from human judgments." Other similar suggestions:
 * Done. popo dameron  ⁠ talk  22:57, 27 March 2024 (UTC)
 * Original: "In RLHF, two different models are trained: a reward model and a reinforcement learning (RL) policy." Suggested Explanatory Sentence: "The reward model determines what outcomes are desirable based on human feedback, while the RL policy decides the actions the AI should take to achieve those outcomes."
 * Done. popo dameron  ⁠ talk  22:57, 27 March 2024 (UTC)
 * Original Sentence: "Both models are commonly initialized using a pre-trained autoregressive language model." Suggested Explanatory Sentence: "Starting with a pre-trained model, which already understands language to some extent, speeds up training and improves the model's initial performance."
 * Done. popo dameron  ⁠ talk  22:57, 27 March 2024 (UTC)
 * Original Sentence: "The reward model is then trained by replacing the final layer of the previous model with a randomly initialized regression head that outputs a number corresponding to the score of any given prompt and response." Suggested Explanatory Sentence: "This process adapts the model to evaluate responses based on the quality standards set by human feedback, scoring them on how well they meet these standards."
 * Made things a bit simpler. popo dameron  ⁠ talk  22:57, 27 March 2024 (UTC)
 * Original Sentence: "This model is trained to minimize the following cross-entropy loss function." Suggested Explanatory Sentence: "Minimizing the cross-entropy loss function helps the model to make predictions that are closer to the actual human ratings, improving its ability to judge responses."
 * Done. popo dameron  ⁠ talk  22:57, 27 March 2024 (UTC)
 * Original Sentence: "The environment randomly presents the policy with prompts from the dataset and expects responses to them." Suggested Explanatory Sentence: "This step simulates real-world scenarios where the AI must understand various prompts and generate appropriate responses, helping it learn from diverse situations."
 * Done. popo dameron  ⁠ talk  22:57, 27 March 2024 (UTC)
 * Original Sentence: "The constant β controls the strength of the second term, which is a per-token KL penalty from the initial unaligned model added to prevent over-optimization of the reward model." Suggested Explanatory Sentence: "By adjusting β, the training can balance learning from new data while retaining useful information from the initial model, avoiding the pitfall of fitting too closely to the training data, which can reduce generalization."
 * Done. popo dameron  ⁠ talk  22:57, 27 March 2024 (UTC)
 * Original Sentence: "A second term is commonly added to the objective function that allows the policy to incorporate the pre-training gradients." Suggested Explanatory Sentence: "Incorporating pre-training gradients helps the model to not forget its initial language understanding abilities while it learns new tasks based on human feedback."
 * Merged that into the existing explanation. popo dameron  ⁠ talk  22:57, 27 March 2024 (UTC)


 * link token
 * Done. popo dameron  ⁠ talk  22:57, 27 March 2024 (UTC)

Limitations
 * again, some terms and concepts (such as "algorithmic bias" and "overfitting") might require additional explanation for readers unfamiliar with machine learning jargon. Comparing overfitting to memorizing rather than understanding could help clarify the concept.
 * Done. popo dameron  ⁠ talk  01:46, 28 March 2024 (UTC)


 * Introducing each limitation with a brief summary or example could enhance clarity. For instance, explaining what is meant by “exploiting loopholes in the reward model” with a concrete example would make the concept more accessible.
 * Added a little bit of introductory stuff. That specific sentence was weakly sourced, so I took it out. popo dameron  ⁠ talk  01:46, 28 March 2024 (UTC)


 * The mention of "algorithmic bias" and "overfitting" provides an excellent opportunity to discuss the ethical implications of RLHF and its applications.
 * I tried looking through the literature for good discussion on ethical implications, but unfortunately I couldn't find anything really worth including for now. Will keep this in mind for the future in case I do find something like that, though. popo dameron  ⁠ talk  01:46, 28 March 2024 (UTC)


 * The final point about the model learning to manipulate the feedback process or game the system is significant and could be expanded. Providing specific examples or case studies where this has occurred could illustrate the severity and reality of this limitation. E.g., ethical concerns related to biased decision-making and the potential harm to individuals or groups who are unfairly treated by biased models. Highlight the importance of diversity and inclusivity in the pool of human annotators and the need for mechanisms to detect and mitigate bias in human feedback and the resulting AI systems. Any real-world examples where algorithmic bias in RLHF models has led to ethical issues? Ethical implications of overfitting, e.g. risk of over-relying on systems that perform well only under specific conditions but fail under others, potentially leading to misguided decisions or harmful outcomes, which is particularly critical in applications affecting people's lives, such as medical diagnosis, financial advice, or legal assessments.
 * I added some more discussion about how the model can game the system to the last paragraph. I also added some information to an earlier paragraph about how under-represented groups can be put at a disadvantage the way things work. popo dameron  ⁠ talk  01:58, 28 March 2024 (UTC)

Alternatives
 * this section could benefit from a more detailed comparison between DPO and RLHF to improve the flow and understanding. This could include discussing the advantages and disadvantages of each method and under what circumstances one might be preferred over the other.
 * Added a bit on that, but there isn't much clarity yet as to when each might actually be better. All that's clear right now is that each has their strengths, but deciding which one to use doesn't seem to be possible without just trying both. popo dameron  ⁠ talk  02:17, 28 March 2024 (UTC)


 * The DPO model is sourced to a single paper; is this just a theoretical concept, or has it ever been actually used in model training?
 * Added a second source from Nvidia research that does train a model using DPO. popo dameron  ⁠ talk  02:17, 28 March 2024 (UTC)


 * current text: "However, instead of training an intermediate reward model to then be optimized by a policy using reinforcement learning, DPO uses a change of variables to define the "preference loss" directly as a function of the policy and uses this loss to fine-tune the model." more layperson-friendly suggestion: "Unlike RLHF, which first trains a separate model to understand what good outcomes look like and then teaches the main model how to achieve those outcomes, DPO simplifies the process. DPO directly adjusts the main model according to people's preferences, helping it understand and prioritize what is important without needing a separate step. This approach directly shapes the model's decisions based on what it should or should not do according to human feedback."
 * Done. popo dameron  ⁠ talk  02:17, 28 March 2024 (UTC)

Final comments


 * the alternative name of this technique, reinforcement learning from human preferences, is mentioned in the lead but not mentioned anywhere in the article text.
 * Removed. popo dameron  ⁠ talk  20:22, 1 April 2024 (UTC)


 * current text: "The constant β controls the strength of the second term, which is a per-token KL penalty from the initial unaligned model added to prevent over-optimization of the reward model." suggestion: "The constant β is used to adjust the intensity of the KL penalty. This penalty is applied on a per-token basis to the initial, unaligned model's outputs. Its purpose is to avoid excessively fine-tuning the reward model, ensuring that the training process does not overly specialize the model on the training data."
 * Done. popo dameron  ⁠ talk  20:22, 1 April 2024 (UTC)


 * The use of the term "supervised-trained model" in the context of defining πSFT could be clarified. It seems to refer to the initial state of the RL policy before fine-tuning, but specifying this directly could prevent confusion.
 * Done. popo dameron  ⁠ talk  20:22, 1 April 2024 (UTC)


 * I'm wondering if it might be beneficial to provide simple, non-technical summaries of the equations in the "Training" section. For example, after the cross-entropy loss function equation, "This equation is a way of measuring how well the reward model's predictions match up with the actual decisions made by humans. It looks at the differences between what the model predicts as a good or bad response to a prompt and what human annotators actually think. The goal is to make the model's guesses as close to human judgments as possible. By minimizing the difference measured by this equation, the model can be trained to understand human preferences better." re: the first objective function: "This equation represents the goal the reinforcement learning (RL) policy is trying to achieve. It calculates a score for how well the policy's responses align with human feedback. The policy generates responses to prompts, and each response is evaluated both on how well it matches human preferences (as determined by the reward model) and how similar it is to responses the model would naturally generate. The equation balances improving alignment with human preferences while ensuring the model's responses remain diverse and not too far removed from what it has learned during its initial training. This helps the model not only to provide answers that people find useful or agreeable but also to maintain a broad understanding and avoid overly narrow or repetitive responses." 3rd equation: "The third equation outlines the method for adjusting the reinforcement learning policy, blending the aim of aligning with human feedback and keeping the original language skills. It introduces a term to the objective function that helps the policy to not only learn from new human feedback but also to keep its basic language understanding from before. This ensures the policy adapts to new preferences without losing its ability to process language effectively, balancing learning new information with preserving its foundational language abilities." I may be extending too far in my efforts to accommodate laypeople, so I won't mind if you object!
 * These are all pretty good and definitely an improvement to the article. Added with some small edits. popo dameron  ⁠ talk  20:22, 1 April 2024 (UTC)


 * I don't think the Christiano forum post is suitable as a reliable source because of the lack of peer review (despite Christiano's obvious expertise, and the relevance/usefulness of the post).
 * Not sure I agree about this one, considering that as you mention, Paul Christiano is one of the biggest names in the field, so he has almost certainly has more experience to be talking about the limitations of the method that the people that would be peer-reviewing him, at least in my opinion. Would this not work in accordance with WP:EXPERTSPS (Self-published expert sources may be considered reliable when produced by an established subject-matter expert, whose work in the relevant field has previously been published by reliable, independent publications.)? popo dameron  ⁠ talk  20:22, 1 April 2024 (UTC)


 * I'm similarly wary of the Wang PDF, which looks likes class material (or maybe a student presentation?) from the "Understanding Large Language Models" course at Princeton.
 * Removed. popo dameron  ⁠ talk  20:22, 1 April 2024 (UTC)


 * in the RLHF overview image, it might not be immediately clear to all readers what "Aligned Model" refers to. If it's the result of applying RLHF, consider labeling it as the "Policy Model (after RLHF)" or something similar for clarity. While the diagram shows the flow of information, it doesn't detail how the reward model impacts the supervised model's training. An additional arrow from the Reward Model to the Aligned Model with a label like "guides training" could clarify this relationship.
 * Done. popo dameron  ⁠ talk  20:22, 1 April 2024 (UTC)


 * I copyedited the article for prose flow, and added some missing bibliographic material. Please check my edits and feel free to revert those you don't agree with.
 * Looks good to me. popo dameron  ⁠ talk  20:22, 1 April 2024 (UTC)


 * I have checked many of the citations for source-text integrity and didn't find any issues. All other GA criteria appear to be met.

Ok, I think that about wraps up what I have to say for this GA nomination. I'll put the review on hold to let you consider and respond to my final comments. Esculenta (talk) 18:44, 1 April 2024 (UTC)


 * Esculenta just finished addressing everything. Only point I disagreed about was the Christiano source, but if you believe strongly that it is not a reliable source, I do not mind removing it. Thanks for all your work and effort reviewing! popo dameron  ⁠ talk  20:23, 1 April 2024 (UTC)
 * No, I don't mind that source really, and am fine with the WP:EXPERTSPS rationale for keeping it. Thanks for your efforts in writing! Promoting now. Esculenta (talk) 20:39, 1 April 2024 (UTC)