Cracking Blackjack — Part 3

栏目: IT技术 · 发布时间: 5年前

Outline for this Article

In this article, I will be explaining the key building blocks that are used in the Reinforcement Learning algorithm we will use to maximize Blackjack returns. These building blocks are also used in many other Reinforcement Learning algorithms, so it is worthwhile to understand them in a context we know and love: Blackjack!

Always keep this diagram of the Reinforcement Learning Cycle from Part 2 in your head as you read this article!

Image made by Author

The Building Blocks of Our RL Algorithm

In short, the only thing our Reinforcement Learning algorithm will do is define what the agent should do with state → action → reward tuples (explained in Part 2 ) after each episode. The building blocks described below help facilitate the updates our agent will make during the learning process to end up with the optimal policy for playing Blackjack.

Key Data Structures our Algorithm will Use and Update

  • Q Table: A table to keep track of the value (or Q-value) of choosing an action given some state. A Q-table is built by taking a cross-product of the observation_space and action_space defined in our Blackjack environment from Part 2 . The initial Q-value for all state/action pairs is 0.
  • Prob Table: A table created in the same way as the Q Table: a cross-product of the observation_space and action_space . This table contains the probability the agent will choose an action given some state. This reinforces the stochastic approach to policies described in Part 1 . The initial probabilities for actions for each state will be 50% hit / 50% stand.
  • The Q table and Prob table define a living, breathing, stochastic policy that our agent will constantly use to make decisions and update after getting rewards back from the environment.

Important Variables that Impact the Agent’s Learning Process

  • Alpha (α): This can be thought of as the learning rate . After our agent gets rewards from the environment for an action in some state, it will update the Q-value of the corresponding state-action pair in our Q-table. α is the weight (or coefficient) given to that change in Q-value. α must be > 0 and ≤ 1. A lower α means that each round of Blackjack has a smaller impact on the policy, and facilitates more accurate learning over a larger number of episodes.
Image Made by Author
  • Epsilon (ε): This can be thought of as an analogous “learning rate” for probabilities in the Prob table. When our agent gets a reward for some state + action, it will also tweak the probability of taking that same action in the future. ε applies a similar weight/coefficient as α to each of these changes. ε must be ≥ 0 and ≤ 1. A higher ε yields a smaller change in the probability of taking an action.
Image Made by Author
  • Epsilon Decay (ε-decay): This is the rate at which ε decays after each episode. At the beginning of the agent’s learning process, we would like ε to start high and make small changes to the Prob table because we want our agent to explore new actions. This helps ensure the final policy isn’t skewed heavily by randomness early in the learning process. For example, we don’t want a few successful “hit” actions for player-hand-value = 18 early in the learning process to make our agent decide that hit is correct in this position in the long run. We reduce ε as the learning process goes on using ε-decay because we want the agent to exploit the accurate insights it has gained in its previous exploration phase.
Image Made by Author
  • Epsilon Minimum (ε-min): The explore vs exploit dynamic is very delicate; the transition from explore to exploit can be very sudden if you are not careful. The ε-min variable sets a limit for how much one episode can alter the probability of an action for some state in the Prob table.
  • Gamma (γ): In a given episode (or round) of Blackjack, the AI agent will make more than just one decision in some cases. Let’s say our AI agent hits when the player hand value = 4, and also makes 2 more decisions after that. The agent gets a reward at the very end of this episode. How much is the initial “hit” action responsible for the final reward? γ helps explain this. We use γ as a discount rate on the final rewards of an episode to approximate the reward of the initial “hit” action. γ must be > 0 and ≤ 1.
Image Made by Author

The variables above should be thought of as levers: they can be increased or decreased to experiment with the agent’s learning process. Later, we will go over which combination of these levers yields the best policy and highest returns in Blackjack.

Request for deletion

About

MC.AI – Aggregated news about artificial intelligence

MC.AI collects interesting articles and news about artificial intelligence and related areas. The contributions come from various open sources and are presented here in a collected form.

The copyrights are held by the original authors, the source is indicated with each contribution.

Contributions which should be deleted from this platform can be reported using the appropriate form (within the contribution).

MC.AI is open for direct submissions, we look forward to your contribution!

Search on MC.AI

mc.ai aggregates articles from different sources - copyright remains at original authors


以上所述就是小编给大家介绍的《Cracking Blackjack — Part 3》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

法律论证理论

法律论证理论

罗伯特·阿列克西 / 舒国滢 / 中国法制出版社 / 2002-12-01 / 30.00

阿列克西的著作探讨的主要问题是如法律裁决之类的规范性陈述如何以理性的方式证立。阿列克西将规范性陈述的证立过程看作实践商谈或“实践言说”,而将法律裁决的证立过程视为“法律言说” 。由于支持法律规范的法律商谈是普遍实践言说的特定形式,所以法律论证理论应当立基于这种一般理论。 在阿列克西看来,如果裁决是理性言说的结果,那么这一规范性陈述就是真实的或可接受的。其基本观念在于法律裁决证立的合理性取决于......一起来看看 《法律论证理论》 这本书的介绍吧!

JS 压缩/解压工具
JS 压缩/解压工具

在线压缩/解压 JS 代码

随机密码生成器
随机密码生成器

多种字符组合密码

XML 在线格式化
XML 在线格式化

在线 XML 格式化压缩工具