The Exploration Exploitation Trade-off

栏目: IT技术 · 发布时间: 5年前

内容简介：The ideas of exploration and exploitation are central to designing anIn training an Agent to learn in a random Environment, the challenges of exploration and exploitation immediately arise. An Agent receives rewards as it interacts with an Environment in a

An Introduction to Reinforcement Learning

Ekaba Bisong

Jan 28 ·3min read

The ideas of exploration and exploitation are central to designing an expedient reinforcement learning system. The word “expedient” is a terminology adapted from the theory of Learning Automata to mean a system in which the Agent (or Automaton) learns the dynamics of the stochastic Environment. In other words, the Agent learns a policy for making actions in a random Environment that is better than pure chance.

In training an Agent to learn in a random Environment, the challenges of exploration and exploitation immediately arise. An Agent receives rewards as it interacts with an Environment in a feedback framework. In order to maximize its rewards, it is typical for the Agent to repeat actions that it tried in the past that produced “favourable” rewards. However, in order to find these actions leading to rewards, the Agent will have to sample from a set of actions and try-out different actions not previously selected. Notice how this idea develops nicely from the “law of effect” in behavioural psychology, where an Agent strengthens mental bonds on actions that produced a reward. In doing so, the Agent must also try-out previously unselected actions; else, it will fail to discover better actions.

The Exploration Exploitation Trade-off — **Reinforcement learning feedback framework.** An agent iteratively interacts with an Environment and learns a policy for maximizing long-term rewards from the Environment.

Exploration is when an Agent has to sample actions from a set of actions in order to obtain better rewards. Exploitation, on the other hand, is when an Agent takes advantage of what it already knows in repeating actions that lead to “favourable” long-term rewards. The key challenge that arises in designing reinforcement learning systems is in balancing the trade-off between exploration and exploitation. In a stochastic environment, actions will have to be sampled sufficiently well to obtain an expected reward estimate. An Agent that pursues exploration or exploitation exclusively is bound to be less than expedient. It becomes worse than pure chance (i.e. a randomized agent).

Multi-armed Bandits

In a multi-armed bandit problem (MAB) (or n-armed bandits), an Agent makes a choice from a set of actions. This choice results in a numeric reward from the Environment based on the selected action. In this specific case, the nature of the Environment is a stationary probability distribution. By stationary, we mean that the probability distribution is constant (or independent) across all states of the Environment. In other words, the probability distribution is unchanged as the state of the Environment changes. The goal of the Agent in a MAB problem is to maximize the rewards received from the Environment over a specified period.

The MAB problem is an extension of the “one-armed bandit” problem, which is represented as a slot machine in a casino. In the MAB setting, instead of a slot machine with one-lever, we have multi-levers. Each lever corresponds to an action the Agent can play. The goal of the Agent is to make plays that maximize its winnings (i.e. rewards) from the machine. The Agent will have to figure out the best levers (exploration) and then concentrate on the levers (exploitation) that will maximize its returns (i.e. the sum of the rewards).

For each action (i.e. lever) on the machine, there is an expected reward. If this expected reward is known to the Agent, then the problem degenerates into a trivial one, which merely involves picking the action with the highest expected reward. But since the expected rewards for the levers are not known, we have to collate estimates to get an idea of the desirability of each action. For this, the Agent will have to explore to get the average of the rewards for each action. After, it can then exploit its knowledge and choose an action with the highest expected rewards (this is also called selecting a greedy action). As we can see, the Agent has to balance exploring and exploiting actions to maximize the overall long-term reward.

Bibliography

Narendra, K. S., & Thathachar, M. A. (2012). Learning automata: An introduction. Courier Corporation.
Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. MIT press.

以上就是本文的全部内容，希望本文的内容对大家的学习或者工作能带来一定的帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

The Exploration Exploitation Trade-off

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

Programming Ruby

Dave Thomas、Chad Fowler、Andy Hunt / Pragmatic Bookshelf / 2004-10-8 / USD 44.95

Ruby is an increasingly popular, fully object-oriented dynamic programming language, hailed by many practitioners as the finest and most useful language available today. When Ruby first burst onto the......一起来看看《Programming Ruby》这本书的介绍吧!

码农工具

The Exploration Exploitation Trade-off

An Introduction to Reinforcement Learning

Multi-armed Bandits

Bibliography

Programming Ruby

RGB转16进制工具

随机密码生成器

MD5 加密