Poke-Agent: Pokemon Battling & Reinforcement Learning

栏目: IT技术 · 发布时间: 4年前

内容简介:Pokemon battling involves choosing the best move each turn, given the current state of both teams. The best move could be to use a super-effective move, or it could be to switch out to another Pokemon (if you’re expecting a super-effective move on your own

Defining The Problem

Pokemon battling involves choosing the best move each turn, given the current state of both teams. The best move could be to use a super-effective move, or it could be to switch out to another Pokemon (if you’re expecting a super-effective move on your own Pokemon).

Learning to play Pokemon is a complex task even for humans, so we’ll focus on one mechanic in this article: type effectiveness .

The scenario:We’ll give the model, Poke-Agent, a Squirtle and have it try to defeat a Charmander . The Squirtle will know Scratch , Growl , and Water Gun , making the optimal strategy to just spam water gun since, as a water type move, it is super-effective against a fire type pokemon like Charmander.

There are other strategies to win, like spamming Scratch or using a combination of all three moves, however, those will result in more losses due to the risk associated with those strategies. The optimal strategy will win in 3 turns.

Learning

As Poke-Agent is playing through Pokemon battles, we can categorize its experiences into states and actions. The state of the game (amount of HP left, what Pokemon are on the field, etc.) will inform what action Poke-Agent will take. As such, we can have Poke-Agent assign a value to each state-action pairing to indicate how good an action is for a particular state: the higher the value, the better that action is for that state. When it needs to make a decision, it can just pick the action with the highest value.

This process is a type of reinforcement learning called Q-Learning , the previously mentioned values are called Q-values. When we want to update these Q-values, we use this function:

Q-Learning Update Function

This might look intimidating at first, but the intuition for it is quite simple. After an experience, we give the model a reward based on whether it exhibited the desired behavior. We’ll update the Q-value for the state-action pair the model experienced by adding to it the reward we expect to get in the next state.

So, if the model performed an action that results in a desired outcome, like using water gun to win a Pokemon battle, we expect to gain a reward after performing that action, so the Q-value for that state and action increases. The opposite is true for a negative outcome, like losing a battle: we’ll expect a negative reward, so the resulting Q-value for that state and action will decrease.

We’ll use the MSE ( mean squared error ) as our loss function:

Poke-Agent: Pokemon Battling & Reinforcement Learning
Mean Squared Error

where Yᵢ is the old Q-value and Ŷᵢ is the new one.

The Architecture

Normally, the Q-values are kept in a table in memory, however, Pokemon battles have way too many different state-action pairs for that to be tractable. Instead, we’ll use a neural network to learn a representation of the game that can be used to calculate good Q-values.

Poke-Agent: Pokemon Battling & Reinforcement Learning

Poke-Agent Architecture

I decided to keep Poke-Agent’s architecture simple until it seems like it needs to be more complex.

Input:First, events from each turn will be translated into vectors and used as inputs to the model.

Embedding:The turn information will be passed to an embedding layer so that the model can create a representation of concepts in Pokemon battles (pokemon names, moves, status conditions, etc.) The hope is that this representation will group similar concepts together: the representation for Squirtle and water gun should be similar.

Linear:This layer will be where the Q-values are actually calculated — it encapsulates the decision-making process for the model.

Output:The linear layer will produce the Q-values for each action the model can take. We’ll interpret the highest value as the decision it wants to make.

Here’s the PyTorch code:

vocab_size is the number of concepts in pokemon. It’s a compilation of Pokemon, move, status conditions, and battle event names.

nn.ReLU is an activation function that allows the model to learn complex concepts.

The Environment

Poke-Agent: Pokemon Battling & Reinforcement Learning

Pokemon Showdown

Pokemon Showdown is an online Pokemon battle simulator that we’ll use to simulate battles.

Training

We’ll have two agents: the Poke-Agent, and a random agent that just chooses random moves.

Here’s the training process:

  1. Instantiate a battle between the two agents
  2. Let the ensuing battle unfold as both agents make decisions over time
  3. Use the last 2 turns as inputs to Poke-Agent to update its Q-values
  4. Repeat

We use only the last 2 turns because that’s when we can assign a reward based on whether or not the model won.

After ~80 battles, here’s the training loss:

Poke-Agent: Pokemon Battling & Reinforcement Learning

At first, it was pretty even since both the random agent and Poke-Agent were basically randomly choosing moves. At ~50 battles, the model learned that water gun would lead to quick victories. The average turn count for each battle went from ~10 to ~3, with Poke-Agent winning every time.

Unfortunately, I don’t have an explanation for what happened at the ~58th battle to cause such a spike in training loss. Maybe that was when it learned about the water gun!

Final Thoughts

It’s really encouraging seeing the loss go down and Poke-Agent start winning consistently! There’s still a long way to go to be able to play a real Pokemon battle against a human in terms of concepts, as well as architecture.

Right now, the model only makes a decision about the current game state, but it might be useful to give the model a series of turns on which to base its decisions. It would also be interesting to have it learn from a series of actions instead of just 1, like in the current training paradigm.

There’s still lots of research to dive into and implement, and lots of experiments to try. I will be teaching Poke-Agent more advanced strategies like how to switch out — follow me on twitter to stay tuned for the next article.


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

Learning Web App Development

Learning Web App Development

Semmy Purewal / O'Reilly Media / 2014-3-3 / USD 29.99

Grasp the fundamentals of web application development by building a simple database-backed app from scratch, using HTML, JavaScript, and other open source tools. Through hands-on tutorials, this pract......一起来看看 《Learning Web App Development》 这本书的介绍吧!

JS 压缩/解压工具
JS 压缩/解压工具

在线压缩/解压 JS 代码

正则表达式在线测试
正则表达式在线测试

正则表达式在线测试