The Exploration Exploitation Trade-off

栏目: IT技术 · 发布时间: 4年前

内容简介:The ideas of exploration and exploitation are central to designing anIn training an Agent to learn in a random Environment, the challenges of exploration and exploitation immediately arise. An Agent receives rewards as it interacts with an Environment in a

An Introduction to Reinforcement Learning

Jan 28 ·3min read

The ideas of exploration and exploitation are central to designing an expedient reinforcement learning system. The word “expedient” is a terminology adapted from the theory of Learning Automata to mean a system in which the Agent (or Automaton) learns the dynamics of the stochastic Environment. In other words, the Agent learns a policy for making actions in a random Environment that is better than pure chance.

In training an Agent to learn in a random Environment, the challenges of exploration and exploitation immediately arise. An Agent receives rewards as it interacts with an Environment in a feedback framework. In order to maximize its rewards, it is typical for the Agent to repeat actions that it tried in the past that produced “favourable” rewards. However, in order to find these actions leading to rewards, the Agent will have to sample from a set of actions and try-out different actions not previously selected. Notice how this idea develops nicely from the “law of effect” in behavioural psychology, where an Agent strengthens mental bonds on actions that produced a reward. In doing so, the Agent must also try-out previously unselected actions; else, it will fail to discover better actions.

The Exploration Exploitation Trade-off

Reinforcement learning feedback framework. An agent iteratively interacts with an Environment and learns a policy for maximizing long-term rewards from the Environment.

Exploration is when an Agent has to sample actions from a set of actions in order to obtain better rewards. Exploitation, on the other hand, is when an Agent takes advantage of what it already knows in repeating actions that lead to “favourable” long-term rewards. The key challenge that arises in designing reinforcement learning systems is in balancing the trade-off between exploration and exploitation. In a stochastic environment, actions will have to be sampled sufficiently well to obtain an expected reward estimate. An Agent that pursues exploration or exploitation exclusively is bound to be less than expedient. It becomes worse than pure chance (i.e. a randomized agent).

Multi-armed Bandits

In a multi-armed bandit problem (MAB) (or n-armed bandits), an Agent makes a choice from a set of actions. This choice results in a numeric reward from the Environment based on the selected action. In this specific case, the nature of the Environment is a stationary probability distribution. By stationary, we mean that the probability distribution is constant (or independent) across all states of the Environment. In other words, the probability distribution is unchanged as the state of the Environment changes. The goal of the Agent in a MAB problem is to maximize the rewards received from the Environment over a specified period.

The MAB problem is an extension of the “one-armed bandit” problem, which is represented as a slot machine in a casino. In the MAB setting, instead of a slot machine with one-lever, we have multi-levers. Each lever corresponds to an action the Agent can play. The goal of the Agent is to make plays that maximize its winnings (i.e. rewards) from the machine. The Agent will have to figure out the best levers (exploration) and then concentrate on the levers (exploitation) that will maximize its returns (i.e. the sum of the rewards).

The Exploration Exploitation Trade-off

Left: One-armed bandit.The slot machine has one lever that returns a numerical reward when played.

Right: Multi-armed bandits.The slot machine has multiple (n) arms, each returning a numerical reward when played. In a MAB problem, the reinforcement agent must balance exploration and exploitation to maximize returns.

For each action (i.e. lever) on the machine, there is an expected reward. If this expected reward is known to the Agent, then the problem degenerates into a trivial one, which merely involves picking the action with the highest expected reward. But since the expected rewards for the levers are not known, we have to collate estimates to get an idea of the desirability of each action. For this, the Agent will have to explore to get the average of the rewards for each action. After, it can then exploit its knowledge and choose an action with the highest expected rewards (this is also called selecting a greedy action). As we can see, the Agent has to balance exploring and exploiting actions to maximize the overall long-term reward.

Bibliography

  • Narendra, K. S., & Thathachar, M. A. (2012). Learning automata: An introduction. Courier Corporation.
  • Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. MIT press.

以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

JavaScript经典实例

JavaScript经典实例

Shelley Powers / 李强 / 中国电力出版社 / 2012-3 / 78.00元

《JavaScript经典实例》各节中的完整代码解决了常见的编程问题,并且给出了在任何浏览器中构建Web应用程序的技术。只需要将这些代码示例复制并粘贴到你自己的项目中就行了,可以快速完成工作,并且在此过程中学习JavaScript的很多知识。你还将学习如何利用ECMAScript5和HTML5中的最新功能,包括新的跨域挂件通信技术、HTML5的video和audio元素,以及绘制画布。《JavaS......一起来看看 《JavaScript经典实例》 这本书的介绍吧!

HTML 压缩/解压工具
HTML 压缩/解压工具

在线压缩/解压 HTML 代码

图片转BASE64编码
图片转BASE64编码

在线图片转Base64编码工具