Cracking Blackjack — Part 3

栏目: IT技术 · 发布时间: 4年前

Outline for this Article

In this article, I will be explaining the key building blocks that are used in the Reinforcement Learning algorithm we will use to maximize Blackjack returns. These building blocks are also used in many other Reinforcement Learning algorithms, so it is worthwhile to understand them in a context we know and love: Blackjack!

Always keep this diagram of the Reinforcement Learning Cycle from Part 2 in your head as you read this article!

Image made by Author

The Building Blocks of Our RL Algorithm

In short, the only thing our Reinforcement Learning algorithm will do is define what the agent should do with state → action → reward tuples (explained in Part 2 ) after each episode. The building blocks described below help facilitate the updates our agent will make during the learning process to end up with the optimal policy for playing Blackjack.

Key Data Structures our Algorithm will Use and Update

Q Table: A table to keep track of the value (or Q-value) of choosing an action given some state. A Q-table is built by taking a cross-product of the observation_space and action_space defined in our Blackjack environment from Part 2 . The initial Q-value for all state/action pairs is 0.
Prob Table: A table created in the same way as the Q Table: a cross-product of the observation_space and action_space . This table contains the probability the agent will choose an action given some state. This reinforces the stochastic approach to policies described in Part 1 . The initial probabilities for actions for each state will be 50% hit / 50% stand.
The Q table and Prob table define a living, breathing, stochastic policy that our agent will constantly use to make decisions and update after getting rewards back from the environment.

Important Variables that Impact the Agent’s Learning Process

Alpha (α): This can be thought of as the learning rate . After our agent gets rewards from the environment for an action in some state, it will update the Q-value of the corresponding state-action pair in our Q-table. α is the weight (or coefficient) given to that change in Q-value. α must be > 0 and ≤ 1. A lower α means that each round of Blackjack has a smaller impact on the policy, and facilitates more accurate learning over a larger number of episodes.

Image Made by Author

Epsilon (ε): This can be thought of as an analogous “learning rate” for probabilities in the Prob table. When our agent gets a reward for some state + action, it will also tweak the probability of taking that same action in the future. ε applies a similar weight/coefficient as α to each of these changes. ε must be ≥ 0 and ≤ 1. A higher ε yields a smaller change in the probability of taking an action.

Image Made by Author

Epsilon Decay (ε-decay): This is the rate at which ε decays after each episode. At the beginning of the agent’s learning process, we would like ε to start high and make small changes to the Prob table because we want our agent to explore new actions. This helps ensure the final policy isn’t skewed heavily by randomness early in the learning process. For example, we don’t want a few successful “hit” actions for player-hand-value = 18 early in the learning process to make our agent decide that hit is correct in this position in the long run. We reduce ε as the learning process goes on using ε-decay because we want the agent to exploit the accurate insights it has gained in its previous exploration phase.

Image Made by Author

Epsilon Minimum (ε-min): The explore vs exploit dynamic is very delicate; the transition from explore to exploit can be very sudden if you are not careful. The ε-min variable sets a limit for how much one episode can alter the probability of an action for some state in the Prob table.
Gamma (γ): In a given episode (or round) of Blackjack, the AI agent will make more than just one decision in some cases. Let’s say our AI agent hits when the player hand value = 4, and also makes 2 more decisions after that. The agent gets a reward at the very end of this episode. How much is the initial “hit” action responsible for the final reward? γ helps explain this. We use γ as a discount rate on the final rewards of an episode to approximate the reward of the initial “hit” action. γ must be > 0 and ≤ 1.

Image Made by Author

The variables above should be thought of as levers: they can be increased or decreased to experiment with the agent’s learning process. Later, we will go over which combination of these levers yields the best policy and highest returns in Blackjack.

Distill is a visual, interactive journal for machinelearning research emphasizing human…

19. July 2020

Softmax Function

20. July 2020

How AI Is Creating A Better Insurance Vertical For Current And Future Clients

12. March 2020

Global Artificial Intelligence in Oil & Gas Status, Dynamic, Future Prospects, Growth, Industry Share And Foresight 2020-2029 – EnerCom Inc.

17. April 2020

Request for deletion

About

MC.AI – Aggregated news about artificial intelligence

MC.AI collects interesting articles and news about artificial intelligence and related areas. The contributions come from various open sources and are presented here in a collected form.

The copyrights are held by the original authors, the source is indicated with each contribution.

Contributions which should be deleted from this platform can be reported using the appropriate form (within the contribution).

MC.AI is open for direct submissions, we look forward to your contribution!

Search on MC.AI

mc.ai aggregates articles from different sources - copyright remains at original authors

以上所述就是小编给大家介绍的《Cracking Blackjack — Part 3》，希望对大家有所帮助，如果大家有任何疑问请给我留言，小编会及时回复大家的。在此也非常感谢大家对码农网的支持！

查看所有标签

猜你喜欢:

Cracking Blackjack — Part 3

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

Python编程快速上手

Albert Sweigart / 王海鹏 / 人民邮电出版社 / 2016-7-1 / 69.00元

如今，人们面临的大多数任务都可以通过编写计算机软件来完成。Python是一种解释型、面向对象、动态数据类型的高级程序设计语言。通过Python编程，我们能够解决现实生活中的很多任务。本书是一本面向实践的Python编程实用指南。本书的目的，不仅是介绍Python语言的基础知识，而且还通过项目实践教会读者如何应用这些知识和技能。本书的首部分介绍了基本Python编程概念，第二部分介绍了一些不......一起来看看《Python编程快速上手》这本书的介绍吧!

码农工具

Cracking Blackjack — Part 3

Outline for this Article

The Building Blocks of Our RL Algorithm

Key Data Structures our Algorithm will Use and Update

Important Variables that Impact the Agent’s Learning Process

Related Articles

Distill is a visual, interactive journal for machinelearning research emphasizing human…

Softmax Function

How AI Is Creating A Better Insurance Vertical For Current And Future Clients

Request for deletion

About

MC.AI – Aggregated news about artificial intelligence

Search on MC.AI

Python编程快速上手

CSS 压缩/解压工具

Base64 编码/解码

html转js在线工具