Draft:Policy-Space Response Oracles

In Multi-Agent Learning, Reinforcement Learning, and Game Theory, Policy-Space Response Oracles (PSRO) is a collection of multi-agent learning algorithms for training agents in two-player, zero-sum, imperfect information (partially observable), extensive form (stochastic form) games using deep reinforcement learning as an approximate best response operator. Knowledge of all player's payoffs is required (complete information). PSRO unifies, and is heavily influenced by, other algorithms such as Double Oracle (DO), fictitious play (FP). It can be considered a framework of algorithms which under certain parameterizations are equivalent to algorithms that have come before it. PSRO is closely related to Empirical Game Theoretic Analysis (EGTA).

The multi-agent problem setting involves agents learning to interact with others in a shared environment. PSRO works by iteratively training new policies against past opponents policies (so called "self-play"). A key property of PSRO is that the resulting distribution over policies it finds provably converges to a normal-form Nash Equilibrium (NE) under certain parameterizations. For two-player, zero-sum games an NE cannot be exploited by any other policy, which makes it a particularly suitable solution concept in this setting.

Many interesting games are two-player, zero-sum ("purely competitive"). Notable projects such as AlphaZero, and AlphaStar make use of this family algorithms. Other classes of games such as those with more than two players, or general payoffs do not provably converge using PSRO. Extensions (such as JPSRO) are are more suitable, but use different solution concepts.

History
(TODO) There is a long list of breakthroughs / other algorithms that PRSO is based on. Credit them here.

Double Oracle (DO).

Empirical game-theoretic analysis (EGTA)

Algorithm
PSRO works by iteratively training a policy against a distribution over all previous opponent policies found so far. This step of the algorithm is called the best response (BR) and is commonly estimated using reinforcement learning (RL) and function approximation (typically a neural network).

The distribution over opponent policies is determined by a meta-solver (MS) - which in turn determines many of the properties of PSRO. For example, if one were to use a uniform distribution, PSRO would be similar to FSP, and if the Nash distribution were used, PSRO would be similar to Double Oracle.

The meta-solver determines a distribution from an a meta-game.

(Placeholder update)

function expected_return(policy policy1, policy policy2) → float is return payoff1, payoff2 function meta_solver(matrix payoff1, matrix payoff2) → (dist, dist) is return payoff1, payoff2 function PSRO(game g) → (dist, dist), (list[policy], list[policy]) is // Initialize. Π1 := {πrandom} Π2 := {πrandom} // Iterate until convergence. for i in 1,... do if gap == 0 then break return (σ1, σ2), (Π1, Π2)

Performance Extensions
Pipeline PSRO

Other solution concepts
XDO is a variant of DO that converges to an extensive-form Nash equilibrium.

Joint Policy-Space Response Oracles (JPSRO) extends PSRO to converge to normal-form correlated equilibria (CEs) and normal-form coarse correlated equilibria (CCEs) in n-player, general-sum extensive form games. (C)CEs are a more suitable solution concept outside for n-player, general-sum games, where there can be many NEs. For two-player, zero-sum games the set of (C)CEs is equal to the set of NEs and therefore PSRO is a special case of JPSRO in this scenario.

The AlphaRank [CITE] has can also be used as a meta-solver.

TODO
Double Oracle citation:

McMahan, H. Brendan, Geoffrey J. Gordon, and Avrim Blum. "Planning in the presence of cost functions controlled by an adversary." Proceedings of the 20th International Conference on Machine Learning (ICML-03). 2003

(Include more citations)

Draft edit update.