The Exploration Exploitation Trade-off

栏目: IT技术 · 发布时间: 4年前

内容简介:The ideas of exploration and exploitation are central to designing anIn training an Agent to learn in a random Environment, the challenges of exploration and exploitation immediately arise. An Agent receives rewards as it interacts with an Environment in a

An Introduction to Reinforcement Learning

Jan 28 ·3min read

The ideas of exploration and exploitation are central to designing an expedient reinforcement learning system. The word “expedient” is a terminology adapted from the theory of Learning Automata to mean a system in which the Agent (or Automaton) learns the dynamics of the stochastic Environment. In other words, the Agent learns a policy for making actions in a random Environment that is better than pure chance.

In training an Agent to learn in a random Environment, the challenges of exploration and exploitation immediately arise. An Agent receives rewards as it interacts with an Environment in a feedback framework. In order to maximize its rewards, it is typical for the Agent to repeat actions that it tried in the past that produced “favourable” rewards. However, in order to find these actions leading to rewards, the Agent will have to sample from a set of actions and try-out different actions not previously selected. Notice how this idea develops nicely from the “law of effect” in behavioural psychology, where an Agent strengthens mental bonds on actions that produced a reward. In doing so, the Agent must also try-out previously unselected actions; else, it will fail to discover better actions.

The Exploration Exploitation Trade-off

Reinforcement learning feedback framework. An agent iteratively interacts with an Environment and learns a policy for maximizing long-term rewards from the Environment.

Exploration is when an Agent has to sample actions from a set of actions in order to obtain better rewards. Exploitation, on the other hand, is when an Agent takes advantage of what it already knows in repeating actions that lead to “favourable” long-term rewards. The key challenge that arises in designing reinforcement learning systems is in balancing the trade-off between exploration and exploitation. In a stochastic environment, actions will have to be sampled sufficiently well to obtain an expected reward estimate. An Agent that pursues exploration or exploitation exclusively is bound to be less than expedient. It becomes worse than pure chance (i.e. a randomized agent).

Multi-armed Bandits

In a multi-armed bandit problem (MAB) (or n-armed bandits), an Agent makes a choice from a set of actions. This choice results in a numeric reward from the Environment based on the selected action. In this specific case, the nature of the Environment is a stationary probability distribution. By stationary, we mean that the probability distribution is constant (or independent) across all states of the Environment. In other words, the probability distribution is unchanged as the state of the Environment changes. The goal of the Agent in a MAB problem is to maximize the rewards received from the Environment over a specified period.

The MAB problem is an extension of the “one-armed bandit” problem, which is represented as a slot machine in a casino. In the MAB setting, instead of a slot machine with one-lever, we have multi-levers. Each lever corresponds to an action the Agent can play. The goal of the Agent is to make plays that maximize its winnings (i.e. rewards) from the machine. The Agent will have to figure out the best levers (exploration) and then concentrate on the levers (exploitation) that will maximize its returns (i.e. the sum of the rewards).

The Exploration Exploitation Trade-off

Left: One-armed bandit.The slot machine has one lever that returns a numerical reward when played.

Right: Multi-armed bandits.The slot machine has multiple (n) arms, each returning a numerical reward when played. In a MAB problem, the reinforcement agent must balance exploration and exploitation to maximize returns.

For each action (i.e. lever) on the machine, there is an expected reward. If this expected reward is known to the Agent, then the problem degenerates into a trivial one, which merely involves picking the action with the highest expected reward. But since the expected rewards for the levers are not known, we have to collate estimates to get an idea of the desirability of each action. For this, the Agent will have to explore to get the average of the rewards for each action. After, it can then exploit its knowledge and choose an action with the highest expected rewards (this is also called selecting a greedy action). As we can see, the Agent has to balance exploring and exploiting actions to maximize the overall long-term reward.

Bibliography

  • Narendra, K. S., & Thathachar, M. A. (2012). Learning automata: An introduction. Courier Corporation.
  • Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. MIT press.

以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

分布式算法导论

分布式算法导论

特尔 (Gerard Tel) / 电子工业出版社 / 2003-7 / 59.00

《分布式算法导论(第2版)(英文版)》由电子工业出版社出版。一起来看看 《分布式算法导论》 这本书的介绍吧!

SHA 加密
SHA 加密

SHA 加密工具

Markdown 在线编辑器
Markdown 在线编辑器

Markdown 在线编辑器

UNIX 时间戳转换
UNIX 时间戳转换

UNIX 时间戳转换