Talk:Reinforcement learning from human feedback

Technical
@Moorlock: What's your reasoning in tagging this article with the template? None of the terms used aren't already explained in the reinforcement learning article, and I certainly would assume that someone coming to this specialized article either already knows the basics of RL, or if not that they will go to the RL article to learn more. Needless to say, we shouldn't be redefining every technical term from RL in this article, too. Popo Dameron ⁠ talk  22:13, 29 March 2023 (UTC)


 * I'm less convinced that readers of this page should be expected to have read a different page first as a prerequisite. I see the acronym RLHF tossed around enough nowadays that I anticipate people may come to Wikipedia just to find out what it stands for. Terms like "robustness," "exploration," "agent," "reward model," "policy," "reward function" are not necessarily meaningful to people who are not already well-versed in the discipline. Wikipedia is best when it makes some effort to explain such jargon to the general reader. Moorlock (talk) 22:33, 29 March 2023 (UTC)
 * @Moorlock: I don't know, I would find it strange to define terms like "agent," "reward model," "policy," and "reward function" in this article when they're just core RL terms. Why would I not assume that they'll follow wikilinks to learn more? I mean, if someone who has no knowledge about RL comes here, then these terms will make no sense until they do understand RL. But that's not the topic of this article, so it shouldn't be expected for it to give a self-contained explanation of basic RL, no? That's what the reinforcement learning article is for.
 * As a similar example, the ChatGPT article says in its lead that the model is built on top "families of large language models (LLMs) and has been fine-tuned (an approach to transfer learning) using both supervised and reinforcement learning techniques," yet terms like "language models," "transfer learning," "supervised learning," and "reinforcement learning," are never defined in the article. Instead, an interested reader can navigate through wikilinks and learn about them. Popo Dameron  ⁠ talk  23:06, 29 March 2023 (UTC)
 * I agree that additional wikilinks to pages or sections that more thoroughly explain some of these terms of art would be helpful and maybe sufficient. Moorlock (talk) 23:17, 29 March 2023 (UTC)

Early learning from human feedback
I think that this page should include a reference to one of the earliest examples of RLHF, from 2014: Scheirer, W. J., Anthony, S. E., Nakayama, K., & Cox, D. D. (2014). Perceptual annotation: Measuring human vision to improve computer vision. IEEE transactions on pattern analysis and machine intelligence, 36(8), 1679-1686. To the best of my (admittedly limited) knowledge, this is the earliest example of systematically using human feedback to improve machine learning, and the authors deserve the credit of a citation and concise discussion of this original work. Jj1236 (talk) 13:59, 25 October 2023 (UTC)jj1236
 * I agree that background is important, but that particular paper does not appear to be about RL at all. Even if it were and with regards to other early papers, we should be careful not to give undue weight, as "RLHF" is a recently coined term that refers to a rather specific concept and not just any kind of RL based on human feedback. popo dameron  ⁠ talk  03:41, 12 March 2024 (UTC)

Attempt to simplify the first paragraph
It seems to me that the first paragraph of the introduction could be better explained, especially for non-specialists. So I tried to rewrite it. But I'm not sure whether it's better or worse, so I write it there instead:

Original version:

In machine learning, reinforcement learning from human feedback (RLHF) is a technique to align an intelligent agent to human preferences. In classical reinforcement learning, the goal of such an agent is to learn a function that guides its behavior called a policy. This function learns to maximize the reward it receives from a separate reward function based on its task performance. However, it is difficult to define explicitly a reward function that approximates human preferences. Therefore, RLHF seeks to train a "reward model" directly from human feedback. The reward model is first trained in a supervised fashion—independently from the policy being optimized—to predict if a response to a given prompt is good (high reward) or bad (low reward) based on ranking data collected from human annotators. This model is then used as a reward function to improve an agent's policy through an optimization algorithm like proximal policy optimization.

Proposed version:

In machine learning, reinforcement learning from human feedback (RLHF) is a technique used to align an intelligent agent with human preferences.

In reinforcement learning, the agent learns how to behave in order to maximize a reward function. However, it is difficult to define explicitly a reward function that approximates human preferences. RLHF first trains a preferences model in a supervised fashion, based on how well humans rank different AI-generated answers. This model is then used as a reward function to train other models, through an optimization algorithm like proximal policy optimization.

Feel free to give some honest feedback on what you think of this proposition. Alenoach (talk) 02:32, 21 May 2024 (UTC)