Talk:Partially observable Markov decision process

Untitled
This page lacks lots of information. A better description of algorithms, some nicer examples, and links to introductory texts would be really good.

It is also sort of biased: only one application is mentioned, when there are actually lots of great work done with POMDPs. —Preceding unsigned comment added by 201.37.187.41 (talk) 09:47, 24 October 2007 (UTC)

Actually, you can solve POMDPs with billions of states. The problem is not how large the state space is, but how dense your transition funcion is (or similarly, how large are the A_s sets). :-)   I actually think people should be more careful when they claim to have solved "large" POMDPs. They usually mention "numbers of states", but the complexity of solving POMDPs is NOT exponential on the number of states. The number of actions and observations is much more of a problem: value iteration is O ( S. A^Z^h ) if you consider a single set of states. It will be similar if you define action sets per state.

rework
I am working on this page whenever I can. I believe it should include some information on the derivation of the belief space MDP, and distinguish between exact and approximate solution techniques. I will add new sections and subsections when they're ready. I'll link to some new developments, such as continuous state POMDPs, bayes-adaptive POMDPs and POMDPs with imprecise parameters. Please let me know if you plan on working on this and let's do it together. Beniz (talk) 21:57, 12 April 2008 (UTC)

software
There are several POMDP-solving programs out there... I guess there should be either a more complete list of them in the article, or none.

Reward based on current or next state?
The end of the formal definition states that a reward is given equal to R(s, a), but this seems confusing to me -- should it be that the reward is based on the state being transitioned to, i.e. R(s', a)?


 * It can go either way. While I agree with you that it is more intuitive and often appropriate to define rewards as dependent on the next state, different formulations of POMDPs and MDPs use different semantics for the reward function. The most complete form is a reward function that depends on the prior state, the action taken, and the resulting next state: $$R(s, a, s')$$, but expressing the problem as a reward function that depends only on the prior state and action ($$R(s, a)$$) is quite common and in some sense simplifies the Value Function equations you solve (it pulls the reward term out of the next state marginalization). If your reward function depends on the next state, you can always reexpress it in the $$R(s, a)$$ form by simply marginalizing over the possible outcomes states: $$R(s,a) = \sum_{s'} T(s' | s, a) R(s, a, s')$$. Since you can always do this conversion, it's okay to use the $$R(s, a)$$ version in theoretical discussions. Jmacglashan (talk) 16:51, 13 January 2015 (UTC)