Unfortunately payoff structures are more complex than coin flips in the real world. The underlying philosophy was “optimism in the face of uncertainty,” and it gave us something provably close to optimal. We called this the “stochastic” setting, and last time we saw a modern strategy called UCB1 which maintained statistical estimates on the payoffs of the actions and chose the action with the highest estimate. In less recent times (circa 1960’s), this problem was posed and considered in the case where the payoff mechanisms had a very simple structure: each slot machine is a coin flip with a different probability $ p$ of winning, and the player’s goal is to find the best machine as quickly as possible. Herbert Robbins, one of the first to study bandit learning algorithms.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |