Cracking Blackjack — Part 3

栏目: IT技术 · 发布时间: 4年前

Outline for this Article

In this article, I will be explaining the key building blocks that are used in the Reinforcement Learning algorithm we will use to maximize Blackjack returns. These building blocks are also used in many other Reinforcement Learning algorithms, so it is worthwhile to understand them in a context we know and love: Blackjack!

Always keep this diagram of the Reinforcement Learning Cycle from Part 2 in your head as you read this article!

Image made by Author

The Building Blocks of Our RL Algorithm

In short, the only thing our Reinforcement Learning algorithm will do is define what the agent should do with state → action → reward tuples (explained in Part 2 ) after each episode. The building blocks described below help facilitate the updates our agent will make during the learning process to end up with the optimal policy for playing Blackjack.

Key Data Structures our Algorithm will Use and Update

  • Q Table: A table to keep track of the value (or Q-value) of choosing an action given some state. A Q-table is built by taking a cross-product of the observation_space and action_space defined in our Blackjack environment from Part 2 . The initial Q-value for all state/action pairs is 0.
  • Prob Table: A table created in the same way as the Q Table: a cross-product of the observation_space and action_space . This table contains the probability the agent will choose an action given some state. This reinforces the stochastic approach to policies described in Part 1 . The initial probabilities for actions for each state will be 50% hit / 50% stand.
  • The Q table and Prob table define a living, breathing, stochastic policy that our agent will constantly use to make decisions and update after getting rewards back from the environment.

Important Variables that Impact the Agent’s Learning Process

  • Alpha (α): This can be thought of as the learning rate . After our agent gets rewards from the environment for an action in some state, it will update the Q-value of the corresponding state-action pair in our Q-table. α is the weight (or coefficient) given to that change in Q-value. α must be > 0 and ≤ 1. A lower α means that each round of Blackjack has a smaller impact on the policy, and facilitates more accurate learning over a larger number of episodes.
Image Made by Author
  • Epsilon (ε): This can be thought of as an analogous “learning rate” for probabilities in the Prob table. When our agent gets a reward for some state + action, it will also tweak the probability of taking that same action in the future. ε applies a similar weight/coefficient as α to each of these changes. ε must be ≥ 0 and ≤ 1. A higher ε yields a smaller change in the probability of taking an action.
Image Made by Author
  • Epsilon Decay (ε-decay): This is the rate at which ε decays after each episode. At the beginning of the agent’s learning process, we would like ε to start high and make small changes to the Prob table because we want our agent to explore new actions. This helps ensure the final policy isn’t skewed heavily by randomness early in the learning process. For example, we don’t want a few successful “hit” actions for player-hand-value = 18 early in the learning process to make our agent decide that hit is correct in this position in the long run. We reduce ε as the learning process goes on using ε-decay because we want the agent to exploit the accurate insights it has gained in its previous exploration phase.
Image Made by Author
  • Epsilon Minimum (ε-min): The explore vs exploit dynamic is very delicate; the transition from explore to exploit can be very sudden if you are not careful. The ε-min variable sets a limit for how much one episode can alter the probability of an action for some state in the Prob table.
  • Gamma (γ): In a given episode (or round) of Blackjack, the AI agent will make more than just one decision in some cases. Let’s say our AI agent hits when the player hand value = 4, and also makes 2 more decisions after that. The agent gets a reward at the very end of this episode. How much is the initial “hit” action responsible for the final reward? γ helps explain this. We use γ as a discount rate on the final rewards of an episode to approximate the reward of the initial “hit” action. γ must be > 0 and ≤ 1.
Image Made by Author

The variables above should be thought of as levers: they can be increased or decreased to experiment with the agent’s learning process. Later, we will go over which combination of these levers yields the best policy and highest returns in Blackjack.

Request for deletion

About

MC.AI – Aggregated news about artificial intelligence

MC.AI collects interesting articles and news about artificial intelligence and related areas. The contributions come from various open sources and are presented here in a collected form.

The copyrights are held by the original authors, the source is indicated with each contribution.

Contributions which should be deleted from this platform can be reported using the appropriate form (within the contribution).

MC.AI is open for direct submissions, we look forward to your contribution!

Search on MC.AI

mc.ai aggregates articles from different sources - copyright remains at original authors


以上所述就是小编给大家介绍的《Cracking Blackjack — Part 3》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

Flask Web开发实战

Flask Web开发实战

李辉 / 机械工业出版社 / 2018-8-1 / 129

这是一本面向Python程序员的,全面介绍Python Web框架Flask的书。关于本书的详细介绍、相关资源等更多信息可以访问本书的官方主页http://helloflask.com/book了解。 • 国内首本Flask著作,在内容上涵盖完整的Flask Web开发学习路径,在实践上包含完整的Flask Web程序开发流程。同时兼容Python2 .7和Python3.6。 • 内......一起来看看 《Flask Web开发实战》 这本书的介绍吧!

RGB转16进制工具
RGB转16进制工具

RGB HEX 互转工具

MD5 加密
MD5 加密

MD5 加密工具

HSV CMYK 转换工具
HSV CMYK 转换工具

HSV CMYK互换工具