Overview / Usage
Reinforcement learning is learning what to do - how to map situations to actions - so as to maximize a numerical reward signal. The learner is not told which actions to take, as in most forms of machine learning, but instead must discover which actions yield the most reward by trying them. In the most interesting and challenging cases, actions may affect not only the immediate reward, but also the next situation and, through that, all subsequent rewards.
#The multi-armed bandit problem
Maximize the reward obtained by successively playing gamble machines (the βarmsβ of the bandits) Invented in early 1950s by Robbins to model decision making under uncertainty when the environment is unknown The lotteries are unknown ahead of time
Assumptions
Each machine π has a different (unknown) distribution law for rewards with (unknown) expectation ππ: Successive plays of the same machine yeald rewards that are independent and identically distributed Independence also holds for rewards across machines Reward = random variable ππ,π ; 1 β€ π β€ πΎ, π β₯ 1 π = index of the gambling machine π = number of plays ππ = expected reward of machine π. A policy, or allocation strategy, π΄ is an algorithm that chooses the next machine to play based on the sequence of past plays and obtained rewards.
Many applications have been studied:
Clinical trials Adaptive routing in networks Advertising: what ad to put on a web-page? Economy: auctions Computation of Nash equilibria
Collaborators
There are no people to show.