Talk:Reinforcement learning

Question
Is R=&Sigma;t&gamma;trt, $$R = \sum \limits_{t^\gamma}^{t} r_t$$ or $$R = \sum \limits_{t\gamma}^{t} r_t$$ or $$R = \sum \limits_{t}^{t} \gamma r_t$$ ?

Answer: It is : $$R = \sum \limits_{t=0}^{\infty} \gamma^{t} r_t$$

Policies
What exactly is a policy? The Sutton-Barto book is very vague on this point, and so is this article. In both cases the word is used without much explanation.

According to both the book and the article, a policy is a mapping from states to action probabilities. Fine. But this is not elaborated upon. What does a policy look like? I infer that it must be a table (2-D array), indexed by state and action, and containing probabilities, say pij for the i-th state and j-th action, each pij being a transition probability for the MDP. If so, what is its relation to the values derived from rewards? I.e. where exactly do the probabilities pij come from? How does one generate a policy table starting from values?

Sorry if I appear stupid, but I've been studying the book and I find it very difficult to comprehend, even though the maths is very simple (almost too simple). Or maybe it's in there somewhere but I've missed it?

--84.9.83.127 09:36, 18 November 2006 (UTC)


 * A policy is indeed a mapping from states to action probabilities, usually written π. So we could write π:S×A→[0,1], saying that π gives a probability of taking a given action a in state s. It doesn't have to be a table, it is just a function. If S and A are discrete then it can be easily written as a table, but if either is continuous then another form is needed. For instance, if S is the interval [0,10], we can set a number of radial basis functions over that interval (say, 11 of them, one at 0, one at 1, one at 2, etc.). Number them r0, ... r10. Now our policy is a function π:r0×...×r10×A→[0,1], which we can no longer write as a table.
 * The relation of the policy to values depends on the particular solution being used for the RL problem. In an actor-critic architecture, the policy is the set of state-action values along with a function for selecting an action (softmax, for instance, or just choosing the action with the highest value) and the state-action values are updated according to state values and the error signal. In a Q-learning agent, the policy and the values are essentially the same. Well, more correctly the policy is a function of the values given by the action selection mechanism.
 * For the most part, when you're just learning reinforcement learning theory, the use of policies may not be particularly clear. At least, in my own case, I didn't understand the focus on policies until I read Sutton, Precup, and Singh (1999) on options, at which point policies became crystal clear.
 * Hope that answers your question. digfarenough (talk) 19:25, 4 March 2007 (UTC)
 * Thanks. But your reply raises more questions for me, which I need to try and find answers to! --84.9.75.142 22:41, 16 March 2007 (UTC) (formerly 84.9.83.127)
 * Feel free to ask further questions on my talk page. I'm certainly no expert on reinforcement learning, but I've written one paper on it and have written a large number of simulations of RL-related things, so I at least know the basics. digfarenough (talk) 01:09, 17 March 2007 (UTC)

I hope the new version explains what a policy might mean. In fact, it has multiple meanings and is used somewhat inconsistently in the literature. Szepi (talk) 03:11, 7 September 2010 (UTC)

merge with Q learning
There is a short article on Q learning and could be merged with reinforcement learning Kpmiyapuram 14:23, 24 April 2007 (UTC)


 * I'd offer that Q Learning be expanded instead. In Q Learning's "See Also" there's Watkins' thesis, which I faintly remember is where Q Learning was introduced; but there's no mention of Watkins or any other researcher in the article. Additionally, Sutton's RL book is listed, which would be a great source to mine for further detail on history and application. --59.167.203.115 (talk) 01:17, 11 January 2008 (UTC)


 * I'd back Q-learning being expanded instead, with a summary in RL. As Q-learning is an active area of research it will grow over time, so it would be short-sighted to merge them - especially as they are already separate. At the start of my research it would have been SO helpful to know what was applicable to RL generally, and what was Q-Learning. --217.37.215.53 (talk) 10:05, 6 March 2008 (UTC)

algorithms/concepts not mentioned

 * active (policy improvement) vs passive (policy evaluation)
 * Adaptive Dynamic Programming (ADP) —Preceding unsigned comment added by 132.177.27.1 (talk) 17:23, 1 April 2008 (UTC)

Szepi (talk) 03:20, 7 September 2010 (UTC)
 * Policy improvement and evaluation are included now. However, these methods are rarely if ever called active/passive. The problems addressed by these methods are control learning and prediction learning. These could be included..
 * ADP refers to approximate dynamic programming, as far as I know. I have added the term to the article. Thanks for the suggestions.

Economics?
Where's all the stuff about learning in games? It would be great if someone could incorporate this. Jeremy Tobacman 23:40, 1 August 2007 (UTC)
 * It's certainly relevant, but you may have to add it yourself if you're familiar with the subject. I've come across that aspect a few times but never really looked into it, though I have seen quite a few papers on interacting multiagent systems from the game and economic perspectives (always, I think, the agents were working against each other to try to maximize profit or win the game, etc.). So add what you know, and others may be able to clean up any incorrect claims. digfarenough (talk) 16:31, 2 August 2007 (UTC)

Psychology
This article starts with a reference to 'Reinforcement learning' in psychology. Isn't there an article about that? --Rinconsoleao 13:43, 27 September 2007 (UTC)
 * Found it... --Rinconsoleao 13:45, 27 September 2007 (UTC)

Literature
I feel the literature referenced by Csaba Szepesvàri was a useful addition and perhaps should not have been removed. Even though he referenced a book written by himself, he is a well known and respected researcher in reinforcement learning and this book is a useful overview of the field. I do not know of many good recent alternatives, so I would favor reverting MrOllie's revision. However, rather than immediately doing so, I thought it might be better to start a discussion.


 * What literature would be indispenable? (In my opinion, in any case the books by Sutton & Barto and by Berstekas & Tsitsiklis, although most of the other referenced work at present also looks fine.)
 * What literature might be removed? (For instance, I haven't read the latest addition by Tokic, is this a relevant enough paper to include?)
 * Is there any important work missing? (As mentioned, I would favor the return of a reference to Csaba Szepesvàri's book.) —Preceding unsigned comment added by 192.16.201.233 (talk) 12:04, 20 September 2010 (UTC)

Attention needed

 * Is there any difference between the "inverse" and "apprenticeship" learning? From the descriptions, they appear to be basically the same.
 * Refs - needs inline refs
 * Check content for missing statements
 * Assess on B scale
 * Broken link: A Short Introduction To Some Reinforcement Learning Algorithms   — Preceding unsigned comment added by 192.76.175.3 (talk) 01:11, 19 March 2016 (UTC)

Chaosdruid (talk) 05:03, 6 March 2011 (UTC)

small and large mdps
'The theory of small mdps is [..] mature; [..] the theory of large mdps needs more work.'

What does that even mean ? Theory is theory; if you understand an mdp with 10 states, than you understand one with ten million states, although standard algorithms may run too slow, I can't see the conceptual difference between ten and ten million as far as theory is concerned.

Does the author mean either: a) small equals finite and large equals countably or uncountably infinite, or b) approximation methods (in itself only useful when direct methods fail) are not as well understood.

— Preceding unsigned comment added by 157.193.140.25 (talk) 09:21, 26 August 2011 (UTC)

I only use small in the context of finite MDPs. "Theory of small, finite MDPs" means theoretical results concerning algorithms whose complexity scales at least linearly with the size of the state-action space. I think this is intuitive, but if you have some suggestions, but I would welcome any alternative suggestions. I realize this could be misunderstood (someone might think that small means 10 or 100s, though I did not think this would be likely to happen).

Szepi (talk) 15:23, 16 September 2011 (UTC)

category needed
Can someone make a sub-category for machine learning maybe? --77.4.90.71 (talk) 16:35, 1 November 2011 (UTC)

The whole article is a subcategory of machine learning. Perhaps you seek practical applications or tools? Or I'm just not sure what you mean. Krehel (talk) 00:11, 24 September 2018 (UTC)

The comparison of algorithms table
The table comparing algorithms is just plain wrong:
 * Monte Carlo is not an algorithm at all, but a family of algorithms for all kinds of problems (including RL). For RL many different Monte Carlo algorithms exist. The description is even more misleading: " 	Every visit to Monte Carlo" Every-visit is only one value of one option in Monte-Carlo RL methods (the other option being First-visit). Not picking Every-visit doesn't make the method less Monte-Carlo, it just changes the update operator for the value function.
 * The Policy column is not actually about the type of policy, but about how the policy is optimised (on- or off-policy).
 * The Operator column does not contain operators: Q-value and Advantage are types of value functions. The operators used on these value functions are what defines the method. The book by Sutton and Barto defines these operators using backup diagrams. Note that Monte-Carlo methods typically also maintain a value function (such as Q-values), they are just updated differently from methods such as Q-learning, which use Bellman backups rather than Monte Carlo estimators of the returns.
 * A number of relevant properties is omitted
 * The table seems heavily biased towards recent neural-network based methods (only the simpler classical methods are represented, giving for example the wrong impression that no classical methods existed that could handle continuous state or action spaces).

I'm not quite sure how to reorganise this table without it becoming monstrous in size. In its present state it is however highly confusing and misleading. I'd say it would be better to remove it than to keep it as it currently is. LordDamorcro (talk) 18:39, 5 July 2021 (UTC)

A Commons file used on this page or its Wikidata item has been nominated for deletion
The following Wikimedia Commons file used on this page or its Wikidata item has been nominated for deletion: Participate in the deletion discussion at the. —Community Tech bot (talk) 20:25, 11 September 2021 (UTC)
 * DNC training recall task.gif

General level of accessibility.
Wikipedia is not for the purpose only of informing persons already expert in the subject matter, not is it a forum for authors to demonstrate their knowledge or show off their technical grasp to others in their field. Articles in Wikipedia are supposed to EXPLAIN things. This means breaking down jargon. It means setting out topics in a manner that makes them approachable for people not already well read in the field.

Too many Wikipedia articles, including this one, are written by peopple incapable of understanding this extremely obvious perspective. The purpose is not to compose some form of canonical description of the field in the most compact, concise or dense langiage possible. It is the opposite. Many authors here are academics, but it seems clear many would struggle successfully to teach a class anything at all. 49.180.205.46 (talk) 10:20, 9 September 2022 (UTC)

research project
related literature about effect of academic pressure 209.35.172.23 (talk) 07:02, 20 April 2023 (UTC)

A section on Applications
It would be good to have a section on the applications of RL on this page. I haven't done any major writing on wiki and not sure If I can just add one. eg. Robotics, self driving cars, gaming (AlphaGo) etc. Amitkannan (talk) 07:16, 26 September 2023 (UTC)


 * Please don't. Applications sections are spam magnets and generally fill up with advertising and self promotion in short order. MrOllie (talk) 12:24, 26 September 2023 (UTC)