Talk:Markov decision process

Notational inconsistency
To adhere to the notation that the article has established, I believe the formula
 * $$ \pi(s) := \operatorname{argmax}_a \left\{ \sum_{s'} P(s' \mid s, a) \left( R(s'\mid s,a) + \gamma V(s') \right) \right\} $$

should be rewritten as
 * $$ \pi(s) := \operatorname{argmax}_a \left\{ \sum_{s'} P_a(s,s') \left( R_a(s,s') + \gamma V(s') \right) \right\} $$

Have I got that wrong? If no-one comments to the contrary, I will make the change. 85.211.24.141 (talk) 14:12, 6 June 2020 (UTC)

Major rewrite and reorganization suggested
As someone who has worked on Optimal Control applications and studied Reinforcement Learning, Optimal Control and hence Markov Decision Processes, I think this article is not well written even for an introductory viewpoint. It feels incomplete and quite random at times. I suggest major rewrite and reorganization of the material, sticking more closely to: (1) Reinforcement Learning An Introduction, second edition by Richard S. Sutton and Andrew G. Barto, or (2) Dynamic Programming and Optimal Control, Vol. I, 4th Edition by Dimitri Bertsekas. — Preceding unsigned comment added by Okankoc (talk • contribs) 14:36, 2 February 2020 (UTC)


 * I disagree. As a newcomer, I found it a very high-quality article. I've fully understood the motivation and definition of MDPs and the idea of using dynamic programming to find the optimal assignment of actions to states. 85.211.24.141 (talk) 14:05, 6 June 2020 (UTC)


 * Well it's much better than the reinforcement learning page. While I'm not a complete newcomer, it seems decent though I think there are some mistakes that need to be fixed.  — Preceding unsigned comment added by 76.116.14.33 (talk) 14:36, 19 July 2020 (UTC)

Untitled
It would also be nice to have a section on Semi-Markov Decision Processes. (This extension to MDP is particularly important for intrinsically motivated RL and temporal abstraction in RL.) Npdoty 01:33, 24 April 2006 (UTC)

It would be nice to hear about partially observable MDPs as well! --Michael Stone 22:59, 23 May 2005 (UTC)


 * Not to mention a link to Markov chain! I've been meaning to expand this article, but I'm trying to decide how best to do it.  Anything that can go in Markov chain should go there, and only stuff specific to decision processes should go here, but there will need to be some frequent cross-reference.  I think eventually POMDPs should get their own article, as well, which should similarly avoid duplicating material in Hidden Markov model. --Delirium 06:15, 13 November 2005 (UTC)

What is γ
The constant γ is used but never defined. What is it?
 * Well, at the first usage it's described as "discounting factor γ (usually just under 1)", which pretty much defines it - do you think it needs to be more prominent than that?
 * This is now fixed in the article

Invented by Howard?
"They were invented by Ronald A. Howard in 1960"

Is that right? Is "invent" the proper term? Also, weren't there works on MDPs (even if with other names) before 1960?


 * Stochastic games were introduced already in . Since they are more general than MDPs, I would be surprised if MDPs were not used even earlier than that.

—The preceding unsigned comment was added by Svensandberg (talk • contribs) 13:31, 9 January 2007 (UTC).


 * "Invent" may not be the right word. However, Howard's book was very important.  In E. V. Denardo's book "Dynamic Programming"  he does mention Shapley (1953) but adds "a lovely book by Howard (1960) highlighted policy iteration and aroused great interest in this model".  So that book set off a lot of subsequent research.  And it is still a classic.  Feel free to replace the word "invent" with another more appropriate... Encyclops 22:58, 9 January 2007 (UTC)


 * I rewrote that a bit and added a reference to Bellman 1957 (which I found in Howard's book). OK? Svensandberg 16:31, 17 January 2007 (UTC)

What is R(s)?
It's probably the immediate reward received after transitioning from state s, but this is not explained in the article currently. There's only R_a(s,s').

To keep the equations simple, you could change the definition to use R(s) and refer to the Minor Extensions section. —Preceding unsigned comment added by 80.221.23.134 (talk) 11:47, 10 November 2008 (UTC)


 * Yeah, $$R(s)$$ is the immediate reward for visiting state $$s$$, left over from an earlier version of the article that used $$R(s)$$ and only mentioned $$R_a(s,s')$$ as a variant. Fixed. Sabik (talk) 18:46, 22 August 2010 (UTC)

Finite state spaces?
Since P is not treated as a density, I assume A is a finite set, but this is not mentioned. Also, is S a finite set? Note that in Partially observable Markov decision process, both sets are said to be finite. Would be helpful to clarify this. I'm going to make the change, so please correct me if I'm wrong. --Rinconsoleao (talk) 15:12, 26 May 2009 (UTC)

Ininite state spaces
The technique for the discounted Markov decision process is valid for an infinite (denumerable) state space and other more general spaces. Shuroo (talk) 12:56, 5 June 2009 (UTC)


 * OK, I have added a clarifying note to that effect, but I have also added a 'citation needed' tag. Can you add a good reference book on this case? --Rinconsoleao (talk) 11:08, 6 June 2009 (UTC)

Clarifying question (especially for mathematicians)
A lot of economists apply dynamic programming to stochastic models, without using the terminology 'Markov decision process'. Which of the following, if any, is true?


 * (1) 'Markov decision process' is a synonym for 'stochastic dynamic programming'


 * (2) 'Markov decision processes' is a subfield of 'stochastic dynamic programming'


 * (3) 'Stochastic dynamic programming' is a subfield of 'Markov decision processes'


 * (4) None of the above is precisely true

Answers, comments, debate, references appreciated. --Rinconsoleao (talk) 11:14, 6 June 2009 (UTC)

I guess MDP is used for a discrete (finite or denumerable) state space. The dynamic programming technique can be used also for continuous state space (e.g. Euclidian space) if the Markov property holds. However, I am not aware of the broadest sufficient condition for its validity. In any case, it seems to me that (2) would be correct if you will write 'infinite horizon dynamic programming'. Shuroo (talk) 18:38, 6 June 2009 (UTC)


 * I vote for (2). MDP is SDP restricted to discrete time and a discrete state space. Encyclops (talk) 21:48, 6 June 2009 (UTC)


 * In my own reading, MDP are not limited to discrete time nor discrete state space. Indeed in Sutton & Barto boork (Sutton, R. S. and Barto A. G. Reinforcement Learning: An Introduction. The MIT Press, Cambridge, MA, 1998), it is clearly said that the discrete description is only chosen for convenience since it avoid to use probability densities and integrals. So IMHO (3) is the right choice. G.Dupont (talk) 21:00, 10 August 2010 (UTC)
 * Wouldn't that mean (1) is the correct answer? Rinconsoleao (talk) 09:29, 11 August 2010 (UTC)

My answer is (4). They are obviously not the same thing, and I don't think the notion of "subfield" makes any sense. (How can you ever say "A" is a subfield of "B"? Once you identify A as interesting, even if you came from B, you already create an emphasis and aspect which distinguishes it from B.) 85.211.24.141 (talk) 14:16, 6 June 2020 (UTC)

Hamilton Jacobi Bellman not belong to MDP
These equations lead to optimal control for a DETERMINSTIC transition state no STOCHASTIC transition state. So it belongs to Deterministic Dynamic Programming. Not here.Shuroo (talk) 08:27, 1 July 2012 (UTC)

Competitive Markov Decision Processes?
I was wondering if anyone would mind adding a section or page about competitive Markov Decision Processes? I do not know much about it, but I believe it is just like a MDP with multiple decision makers. — Preceding unsigned comment added by 150.135.222.234 (talk) 00:53, 22 March 2013 (UTC)

Partial observability
Is the claim that Burnetas and Katehakis' paper was a major advance in this area supportable? It may be a nice paper but the rest of the paragraph does not make the impact clear and uses some unexplained jargon. Is this more of an advance than the Witness algorithm, for example, (which isn't mentioned)?Jbrusey (talk) 15:20, 28 January 2016 (UTC)

Stationary
In one section the word stationary is used to describe a type of policy, but the word is never defined (and, indeed, stationary policies are probably out of scope).

On the other hand, the stationarity property of MDPs should probably be discussed somewhere. — Preceding unsigned comment added by Cokemonkey11 (talk • contribs) 12:19, 25 April 2016 (UTC)

Dr. Johansen's comment on this article
Dr. Johansen has reviewed this Wikipedia page, and provided us with the following comments to improve its quality:

"I have read the note and it sounds very convincing. I am not an expert in the decision theory, si I cannot recommend another reeviewer"

We hope Wikipedians on this talk page can take advantage of these comments and improve the quality of the article accordingly.

We believe Dr. Johansen has expertise on the topic of this article, since he has published relevant scholarly research:


 * Reference : Sren Johansen & Bent Nielsen, 2014. "Outlier detection algorithms for least squares time series regression," Economics Papers 2014-W04, Economics Group, Nuffield College, University of Oxford.

ExpertIdeasBot (talk) 16:24, 11 July 2016 (UTC)


 * Sounds like Dr Johansen could do with a spelling checker (I'd also advise him to concentrate on comments whose subject matter he knows something about). 85.211.24.141 (talk) 14:08, 6 June 2020 (UTC)

What is $$x$$?
What is $$x$$ in the central formula of Section "Problem"? — Preceding unsigned comment added by 176.207.10.78 (talk) 17:12, 16 November 2019 (UTC)

Policy iteration steps mislabeled
The value and policy steps/equations are not labeled, so it looks like the value step is step 1 and the policy step is step 2. I believe this is the normal order of things. First the algorithm updates the value function and then the algorithm computes the new policy. The problem is the description of the algorithms have them backwards. For example, the policy iteration algorithm talks about how repeating step 2 can be replaced by solving the linear equations. But the linear equations are the value equations which come from step 1. It looks like this is a consistent error. — Preceding unsigned comment added by Mesterharm (talk • contribs) 14:58, 19 July 2020 (UTC)