User:DltLuis/draft article on POMDP

A Partially Observable Markov Decision Process (POMDP) is a generalization of a Markov Decision Process. A POMDP models an agent decision process in which it is assumed that the system dynamics are determined by an MDP, but the agent cannot directly observe the underlying state. Instead, it must maintain a probability distribution over the set of possible states, based on a set of observations and observation probabilities, and the underlying MDP.

An solution to a POMDP must yield an action for each possible belief over the directly unobservable states. A solution is optimal if it maximizes (or minimizes) the expected reward (or cost) of the agent over a possibly infinite horizon. A sequence of optimal actions is known as an optimal policy of the agent interacting with its environment.

The POMDP framework is general enough to model a variety of real-world sequential decision processes. Applications include robot navigation problems, machine maintenance, and planning under uncertainty in general. The framework originated in the Operations Research community. In addition, much work in PODMDP has been done by the Artificial Intelligence and Automated Planning communities.

Formal Definition
A discrete-time POMDP models the relationship between an agent and its environment. Formally, a POMDP is a tuple $$(S,A,O,T,\Omega,R)$$, where
 * $$S$$ is a set of (unobservable) states,
 * $$A$$ is a set of actions,
 * $$O$$ is a set of observations,
 * $$T$$ is a set of conditional probabilities of transitioning between states,
 * $$\Omega$$ is a set of conditional probabilities of observation,
 * $$R: A \times S \times S \to \mathbb{R}$$ is the reward function.

Decision Making Sequence
At each time period, the environment is in some state $$s \in S$$. The agent does not know with certainty the current state and instead maintains a probability $$b(s) $$, the probability that he is in state $$ s $$.

The agent takes an action $$a \in A$$, which causes the environment to transition to state $$s'$$ with probability $$T(s'\mid s,a)$$. The agent then observes some information $$ o \in O $$ with some probability and updates the probabilities $$b $$ based on the current observations. The agent receives a reward with expected value $$R(a,s,s')$$, and the process repeats.

It is instructive to compare the above definition with the definition of a Markov Decision Process. In an MDP, the state $$ s $$ is observed directly. Thus, an MDP is a special case of a POMDP where $$ S = O $$, the observation $$ o $$ is $$ s' $$, and the belief probability function $$ b(s) $$ and the observation probabilities $$ \Omega(o | s',a) $$ are 1 for the states $$ s $$ and $$ s' $$, respectively, and zero for all other states.

Belief update
An agent updates his belief upon taking the action $$a$$ and observing $$o$$. Since the state is Markovian, maintaining a belief over the states solely requires knowledge of the previous belief state, the action taken, and the current observation. Below we describe how this belief update is computed.

For each state $$ s', b(s') $$ is updated using Bayes' Rule. Given observation $$ o $$, $$ b'(s') $$ and $$ \sum_{s\in S}T(s'\mid s,a)b(s)$$ are the posterior and prior probabilities of being in state $$ s' $$, and observation $$ o $$ is observed with probability $$ \Omega(o\mid s',a) $$. Applying Bayes' Rule gives

$$ b'(s') = \frac{ \Omega(o \mid s',a) }{\sum_{s \in S} \Omega(o \mid s,a) \sum_{s \in S} T(s'' \mid s,a) b(s)} \sum_{s \in S} T(s' \mid s,a) b(s). $$

Reward Function
Since the state is observable, the reward function $$ R(a,s,s') $$ is not directly computable. Instead, the reward is computed as an average over possible states, given belief probabilities:

$$ r(b,a) = \sum_{s \in S} \sum_{s' \in S} b(s) T(s' \mid s,a) R(a,s,s'). $$

Representing POMDPs as Continuous State MDPs
In an MDP, an optimal policy $$\pi$$ is Markovian with respect to the core state. For a POMDP, a policy that is Markovian on the core states would be unusable, since the core state is not known by the agent. A policy that is Markovian with respect to observations is well-defined, but can be arbitrarily suboptimal. . Instead of keeping track of the entire history, the belief state is a sufficient statistic for the history of observations. .

In a PODMP, a policy maps a belief state into the action space. The optimal policy can be understood as the solution of a continuous space so-called belief Markov Decision Process (MDP). It is defined as a tuple $$(B,A,\tau,r)$$ where
 * $$B$$ is the set of belief states over the POMDP states,
 * $$A$$ is the same finite set of action as for the original POMDP,
 * $$\tau$$ is the belief state transition function,
 * $$r:B,A \to \mathbb{R}$$ is the reward function on belief states defined in the previous section.

This MDP is defined over a continuous state space, the space of all probability mass (or density) functions over $$S$$. Given that the belief space is a probability mass (or density) function over $$S$$, the belief space is the is the $$ n-1$$-dimensional standard simplex $$\Delta^{n-1}$$, where

$$\Delta^n = \left\{(b_0,\ldots,b_n) \in \mathbb{R}^{n+1} | \sum_{i=0}^n b_i = 1, b_i \geq 0, i=0,\ldots,n \right\}$$

Policy and Value Function
The agent's policy $$\pi$$ specifies an action $$a=\pi(b)$$ for any belief $$b$$. Here it is assumed the objective is to maximize the expected total discounted reward over an infinite horizon. When $$R$$ defines a cost, the objective becomes the minimization of the expected cost.

The expected reward for policy $$\pi$$ starting from belief $$b$$ is defined as

J^\pi(b) = E\Bigl[ \sum_{t=0}^\infty \gamma^t r(s_t,a_t) \mid b, \pi \Bigr] $$ where $$\gamma<1$$ is the discount factor. The optimal policy $$\pi^*$$ is obtained by optimizing the long-term reward.

\pi^* = \underset{\pi}{\mbox{argmax}} J^\pi(b_0) $$ where $$b_0$$ is the initial belief.

The optimal policy, noted $$\pi^*$$ yields the highest expected reward value for each belief state, compactly represented by the optimal value function, noted $$V^*$$. This value function is a solution to the Bellman optimality equation:

V^*(b) = \max_{a\in A}\Bigl[ r(b,a) + \gamma\sum_{o\in O} \Omega(o\mid b,a) V^*(\tau(b,a,o)) \Bigr] $$

Solving POMDPs
Unlike with a discrete state MDP, the value functions for each state cannot be stored in a look-up table as they are computed. In general, the value function for a continuous state MDP need not have any specific structure. For finite-horizon POMDPs, the optimal value function is piecewise-linear and convex. It can be represented as a finite set of vectors. In the infinite-horizon formulation, a finite vector set can approximate $$V^*$$ arbitrarily closely, whose shape remains convex. Value iteration applies dynamic programming update to gradually improve on the value until convergence to an $$\epsilon$$-optimal value function, and preserves its piecewise linearity and convexity. By improving the value, the policy is implicitly improved. Policy iteration explicitly represents and improves the policy instead.

Approximate POMDP solutions
In practice, POMDPs are often computationally intractable to solve exactly, so researchers have developed methods that approximate solutions for POMDPs.

Grid-based algorithms comprise one approximate solution technique. In this approach, the value function is computed for a set of points in the belief space, and interpolation is used to determine the optimal action to take for other belief states that are encountered and that aren't in the set of grid points. Other recent work makes use of sampling techniques, generalization techniques and exploitation of problem structure. For example, the Symbolic Perseus solver has been used to approximate an application with 207,360 states, 198 observations and 25 actions. Approximate methods often take into account the problem structure when sampling from the belief space. In a problem, there may be belief states that are unlikely to be reached. For these states, an accurate estimate of the value function may contribute little to defining an optimal policy. Instead, methods such as point-based value iteration attempt to sample a smaller set of more likely reachable points and get accurate estimates of relevant belief states. Dimensionality reduction using PCA has also been explored.

POMDP Applications
POMDPs model many kinds of real-world problems. Below is a selection of applications and references.


 * Assisting for persons with dementia.
 * Cancer Screening.
 * Conservation of the critically endangered and difficult to detect Sumatran tigers.
 * Inventory Management.
 * Manipulating multi-fingered robotic hands for grasping.
 * Robot Navigation.
 * Selecting Relays to Transmit Wireless Data
 * Spoken dialog systems, including speech recognition and response.

POMDP Software
Below is a listing of some POMDP software. All are free to download (licenses on use vary between them). Many implement multiple types of algorithms and the different methods implemented in each vary significantly from one another.