User:Ob7/sandbox/Off-policy learning

In reinforcement learning, off-policy learning refers to the problem of learning about a policy while data is generated using a different policy. A typical case is policy evaluation, where the objective is to estimate the value function for a given stationary policy. In off-policy policy evaluation, the objective is then to estimate that function while a different policy is followed.