Poke-Agent: Pokemon Battling & Reinforcement Learning

栏目: IT技术 · 发布时间: 4年前

内容简介：Pokemon battling involves choosing the best move each turn, given the current state of both teams. The best move could be to use a super-effective move, or it could be to switch out to another Pokemon (if you’re expecting a super-effective move on your own

Defining The Problem

Pokemon battling involves choosing the best move each turn, given the current state of both teams. The best move could be to use a super-effective move, or it could be to switch out to another Pokemon (if you’re expecting a super-effective move on your own Pokemon).

Learning to play Pokemon is a complex task even for humans, so we’ll focus on one mechanic in this article: type effectiveness .

The scenario:We’ll give the model, Poke-Agent, a Squirtle and have it try to defeat a Charmander . The Squirtle will know Scratch , Growl , and Water Gun , making the optimal strategy to just spam water gun since, as a water type move, it is super-effective against a fire type pokemon like Charmander.

There are other strategies to win, like spamming Scratch or using a combination of all three moves, however, those will result in more losses due to the risk associated with those strategies. The optimal strategy will win in 3 turns.

Learning

As Poke-Agent is playing through Pokemon battles, we can categorize its experiences into states and actions. The state of the game (amount of HP left, what Pokemon are on the field, etc.) will inform what action Poke-Agent will take. As such, we can have Poke-Agent assign a value to each state-action pairing to indicate how good an action is for a particular state: the higher the value, the better that action is for that state. When it needs to make a decision, it can just pick the action with the highest value.

This process is a type of reinforcement learning called Q-Learning , the previously mentioned values are called Q-values. When we want to update these Q-values, we use this function:

Q-Learning Update Function

This might look intimidating at first, but the intuition for it is quite simple. After an experience, we give the model a reward based on whether it exhibited the desired behavior. We’ll update the Q-value for the state-action pair the model experienced by adding to it the reward we expect to get in the next state.

So, if the model performed an action that results in a desired outcome, like using water gun to win a Pokemon battle, we expect to gain a reward after performing that action, so the Q-value for that state and action increases. The opposite is true for a negative outcome, like losing a battle: we’ll expect a negative reward, so the resulting Q-value for that state and action will decrease.

We’ll use the MSE ( mean squared error ) as our loss function:

Poke-Agent: Pokemon Battling & Reinforcement Learning — Mean Squared Error

where Yᵢ is the old Q-value and Ŷᵢ is the new one.

The Architecture

Normally, the Q-values are kept in a table in memory, however, Pokemon battles have way too many different state-action pairs for that to be tractable. Instead, we’ll use a neural network to learn a representation of the game that can be used to calculate good Q-values.

I decided to keep Poke-Agent’s architecture simple until it seems like it needs to be more complex.

Input:First, events from each turn will be translated into vectors and used as inputs to the model.

Embedding:The turn information will be passed to an embedding layer so that the model can create a representation of concepts in Pokemon battles (pokemon names, moves, status conditions, etc.) The hope is that this representation will group similar concepts together: the representation for Squirtle and water gun should be similar.

Linear:This layer will be where the Q-values are actually calculated — it encapsulates the decision-making process for the model.

Output:The linear layer will produce the Q-values for each action the model can take. We’ll interpret the highest value as the decision it wants to make.

Here’s the PyTorch code:

vocab_size is the number of concepts in pokemon. It’s a compilation of Pokemon, move, status conditions, and battle event names.

nn.ReLU is an activation function that allows the model to learn complex concepts.

The Environment

Pokemon Showdown is an online Pokemon battle simulator that we’ll use to simulate battles.

Training

We’ll have two agents: the Poke-Agent, and a random agent that just chooses random moves.

Here’s the training process:

Instantiate a battle between the two agents
Let the ensuing battle unfold as both agents make decisions over time
Use the last 2 turns as inputs to Poke-Agent to update its Q-values
Repeat

We use only the last 2 turns because that’s when we can assign a reward based on whether or not the model won.

After ~80 battles, here’s the training loss:

At first, it was pretty even since both the random agent and Poke-Agent were basically randomly choosing moves. At ~50 battles, the model learned that water gun would lead to quick victories. The average turn count for each battle went from ~10 to ~3, with Poke-Agent winning every time.

Unfortunately, I don’t have an explanation for what happened at the ~58th battle to cause such a spike in training loss. Maybe that was when it learned about the water gun!

Final Thoughts

It’s really encouraging seeing the loss go down and Poke-Agent start winning consistently! There’s still a long way to go to be able to play a real Pokemon battle against a human in terms of concepts, as well as architecture.

Right now, the model only makes a decision about the current game state, but it might be useful to give the model a series of turns on which to base its decisions. It would also be interesting to have it learn from a series of actions instead of just 1, like in the current training paradigm.

There’s still lots of research to dive into and implement, and lots of experiments to try. I will be teaching Poke-Agent more advanced strategies like how to switch out — follow me on twitter to stay tuned for the next article.

Caleb Lewis

The latest Tweets from Caleb Lewis (@caleb_dre). I bake & write code, equations (sometimes), and @fold_app's mobile…

twitter.com

以上就是本文的全部内容，希望本文的内容对大家的学习或者工作能带来一定的帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

Poke-Agent: Pokemon Battling & Reinforcement Learning

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

淘宝天猫店是如何运营的

贾真 / 电子工业出版社 / 2017-5 / 49.8

《淘宝天猫店是如何运营的——网店从0到千万实操手册》是由天猫行业Top10卖家、电商圈知名讲师贾真写就的一本运营干货书籍。《淘宝天猫店是如何运营的——网店从0到千万实操手册》的最大卖点就是作者把自己运营店铺的经验系统地总结出来，把碎片化的“干货”形成一个系统的知识体系。句句易懂，读后受益！现在网上能看到的电商经验，大多是碎片化知识，零散不成体系，其实很难系统地给卖家提供帮助。《淘宝天猫店是......一起来看看《淘宝天猫店是如何运营的》这本书的介绍吧!

码农工具