Talk:Multi-armed bandit/Archives/2015

The multi-armed bandit model
The sentence stating "The regret $$\rho$$ after T rounds is defined as the difference between the reward sum associated with an optimal strategy and the sum of the collected rewards:" is incorrect. It should say "...is defined as the expected difference between..." not "the difference between" since $$T\mu^*$$ represents an expected value. Chafe66 (talk) 20:11, 30 November 2012 (UTC)


 * In fact, it is my position that defining regret in many cases is non-trivial. The regret at time $$t$$ may be defined one of a few different ways and indeed varies between papers. Especially important is the difference in strong and weak or pseudo-regret, but also the difference in expected value regret (taking expectation of the arm pulled, versus sampling from it) and experimental regret (using the best trace in an experimental context as the oracle for defining correctness). On Wikipedia, however, the aim is for verifiability, so the definition should only consider the published forms of regret definitions for now. It should acknowledge how they differ though. Giuseppe Burtini (talk) 23:35, 11 December 2014 (UTC)


 * I can't see how anything but an expected value makes any sense at all here. Any one experimental case is irrelevant. What matters is how an algorithm does *on average*, and that means taking expectations.Chafe66 (talk) 07:54, 29 May 2015 (UTC)


 * For one possible counterexample, in some application areas you care about other moments of regret or percentiles of regret rather than expectation. This is important (in my humble opinion) when we discuss variants such as risk-adjusted bandits, applications in medical trials (maximum allowed risk) and financial market applications (maximum allowable drawdown). Giuseppe Burtini (talk) 14:40, 1 June 2015 (UTC)

NeuralBandit and adversarial vs. contextual categorization
With only a cursory exploration at this point, it appears to me that NeuralBandit (and perhaps its parent algorithm, Banditron, currently unmentioned) belongs in the adversarial section, or perhaps contextual should be restructured to be a sub-type of adversarial. An adversarial bandit is one with no constraints on its underlying reward distribution, while a contextual bandit is one where the reward is expected to be generated by some other stochastic model. I think the distinction and hierarchy of sub-bandit problems requires some discussion, so I haven't gone ahead and made the change yet. Giuseppe Burtini (talk) 14:52, 13 December 2014 (UTC)


 * I retract the above after substantial further research. The authors of the NB paper call it contextual, it should be listed as contextual. Giuseppe Burtini (talk) 03:17, 6 August 2015 (UTC)