Diving deeper into Unity-ML Agents

栏目: IT技术 · 发布时间: 4年前

Unity-ML Agents Course

Diving deeper into Unity-ML Agents

Train a curious agent to destroy Pyramids.

Diving deeper into Unity-ML Agents

Unity ML-Agents

This article is the second chapter of a new free course on Deep Reinforcement Learning with Unity. Where we’ll create agents with TensorFlow that learn to play video games using the Unity game engine . Check the syllabus here .

If you never study Deep Reinforcement Learning before, you need to check the free course Deep Reinforcement Learning with Tensorflow.

Last time, we learned about how Unity ML-Agents works and trained an agent that learned to jump over walls.

Diving deeper into Unity-ML Agents

This was a nice experience, but we want to create agents that can solve more complex tasks. So today we’ll train a smarter one that needs to press a button to spawn a pyramid, then navigate to the pyramid, knock it over, and move to the gold brick at the top.

Diving deeper into Unity-ML Agents

To train this new agent, that seek for that button and then the pyramid to destroy, we’ll use a combination of two types of rewards, the extrinsic one given by the environment. But also an intrinsic one called curiosity. This second will push our agent to be curious, or in other terms, to better explore its environment.

So today we’ll learn about the theory behind this powerful idea of curiosity in deep reinforcement learning and we’ll train this curious agent.

Let’s get started!

What is Curiosity in Deep RL?

I already cover curiosity in detail in 2 other articles here and here if you want to dive into the mathematical and implementation details.

Two Major Problems in Modern RL

To understand what is curiosity, we need first to understand the two major problems with RL:

First, the sparse rewards problem : that is, most rewards do not contain information, and hence are set to zero.

Remember that RL is based on the reward hypothesis , which is the idea that each goal can be described as the maximization of the rewards. Therefore, rewards act as feedback for RL agents, if they don’t receive any, their knowledge of which action is appropriate (or not) cannot change.

Diving deeper into Unity-ML Agents

Thanks to the reward, our agent knows that this action at that state was good

For instance, in Vizdoom “DoomMyWayHome,” your agent is only rewarded if it finds the vest. However, the vest is far away from your starting point, so most of your rewards will be zero. Therefore, if our agent does not receive useful feedback (dense rewards), it will take much longer to learn an optimal policy and it can spend time turning around without finding the goal.

Diving deeper into Unity-ML Agents

A big thanks to Felix Steger for this illustration

The second big problem is that the extrinsic reward function is handmade, that is in each environment, a human has to implement a reward function . But how we can scale that in big and complex environments?

So what is curiosity?

Therefore, a solution to these problems is to develop a reward function that is intrinsic to the agent, i.e. , generated by the agent itself. The agent will act as a self-learner since it will be the student, but also its own feedback master.

This intrinsic reward mechanism is known as curiositybecause this reward push to explore states that are novel/unfamiliar. In order to achieve that, our agent will receive a high reward when exploring new trajectories.

This reward is in fact designed on how human acts, we have naturally an intrinsic desire to explore environments and discover new things.

There are different ways to calculate this intrinsic reward, and Unity ML-Agents use curiosity through the next-state prediction method.

Curiosity Through Prediction-Based Surprise (or Next-State Prediction)

I already cover this method here if you want to dive into the mathematical details.

So we just said that curiosity was high when we were in unfamiliar/novel states. But how we can calculate this “unfamiliarity”?

We can calculate curiosity as the error of our agent of predicting the next state, given the current state and action taken. More formally, we can define this as:

Diving deeper into Unity-ML Agents

Why? Because the idea of curiosity is to encourage our agent to perform actions that reduce the uncertainty in the agent’s ability to predict the consequences of its own actions (uncertainty will be higher in areas where the agent has spent less time, or in areas with complex dynamics).

If the agent spend a lot of times on these states, it will be good to predict the next state (low curiosity), on the other hand, if it’s a new state unexplored, it will be bad to predict the next state (high curiosity).

Let’s break it down further. Say you play Super Mario Bros:

  • If you spend a lot of time at the beginning of the game (which is not new), the agent will be able to accurately predict what the next state will be , so the reward will be low.
  • On the other hand, if you discover a new room, our agent will be very bad at predicting the next state, so the agent will be pushed to explore this room.

Diving deeper into Unity-ML Agents

Using curiosity will push our agent to favor transitions with high prediction error (which will be higher in areas where the agent has spent less time, or in areas with complex dynamics) and consequently better explore our environment.

But because we can’t predict the next state by predicting the next frame (too complicated to predict pixels directly), we use a better feature representation that will keep only elements that can be controlled by our agent or affect our agent.

Diving deeper into Unity-ML Agents

And to calculate curiosity, we will use a module introduced in the paper, Curiosity-driven Exploration by Self-supervised Prediction called Intrinsic Curiosity module.

Diving deeper into Unity-ML Agents

If you want to know it works, check our detailled article

Train an agent to destroy pyramids

So now that we understand what is curiosity through the next state prediction and how it works, let’s train this new agent.

We published our trained models on github, you can download them here.

The Pyramid Environment

The goal in this environment is to train our agent to get the gold brick on the top of the pyramid. In order to do that he needs to press a button to spawn a pyramid, then navigate to the pyramid, knock it over, and move to the gold brick at the top.

Diving deeper into Unity-ML Agents

The reward system is:

Diving deeper into Unity-ML Agents

In terms of observation, we use the raycast version. With 148 raycasts, but detecting switch, bricks, golden brick, and walls.

We also use a boolean variable indicating the switch state.

The action space is discrete with 4 possible actions:

Diving deeper into Unity-ML Agents

Our goal is to hit the benchmark with a mean reward of 1.75.

Let’s destroy some pyramids!

First of all, let’s open the UnitySDK project.

In the examples search for Pyramids and open the scene.

Like WallJump, you see in the scene, a lot of Agents, each of them comes from the same Prefab and they all share the same Brain (policy).

Diving deeper into Unity-ML Agents

Multiple copies of the same Agent Prefab.

In fact, as we do in classical Deep Reinforcement Learning when we launch multiple instances of a game (for instance 128 parallel environments) we do the same hereby copy and paste the agents, in order to have more various states.

So, first, because we want to train our agent from scratch, we need to remove the brain from the agent prefab. We need to go to the prefabs folder and open the Prefab.

Now in the Prefab hierarchy, select the Agent and go into the inspector.

In Behavior Parameters , we need to remove the Model . If you have some GPU you can change Inference Device from CPU to GPU.

Diving deeper into Unity-ML Agents

For this first training, we’ll just modify the total training steps because it’s too high and we can hit the benchmark in only 500k training steps. To do that we go to config/trainer_config.yaml and you modify these to max_steps to 5.0e5 for Pyramids situation:

Diving deeper into Unity-ML Agents

To train this agent, we will use PPO (Proximal Policy Optimization) if you don’t know about it or you need to refresh your knowledge,check my article.

We saw that to train this agent, we need to call our External Communicator using the Python API. This External Communicator will then ask the Academy to start the agents.

So, you need to open your terminal, go where ml-agents-master is and type this.

mlagents-learn config/trainer_config.yaml — run-id=”Pyramids_FirstTrain” — train

It will ask you to run the Unity scene,

Press the :arrow_forward: button at the top of the Editor.

Diving deeper into Unity-ML Agents

You can monitor your training by launching Tensorboard using this command:

tensorboard — logdir=summaries

Watching your agent jumping over walls

You can watch your agent during the training by looking at the game window.

When the training is finished you need to move the saved model files contained in ml-agents-master/models to UnitySDK/Assets/ML-Agents/Examples/Pyramids/TFModels.

And again, open the Unity Editor, and select Pyramids scene.

Select the Pyramids prefab object and open it.

Select Agent

In Agent Behavior Parameters , drag the Pyramids.nn file to Model Placeholder.

Diving deeper into Unity-ML Agents

Then, press the :arrow_forward: button at the top of the Editor.

Time for some experiments

We’ve just trained our agents to learn to jump over walls. Now that we have good results we can try some experiments.

Remember that the best way to learn is to be active by experimenting. So you should try to make some hypotheses and verify them.

By the way, there is an amazing video about how to hyperparameter tuning Pyramid environment by Immersive Limit that you should definitely watch.

Increasing the time horizon to 256

The time horizon, as explained in the documentation , is the number of steps of experience to collect per-agent before putting it into the experience buffer. This trades off between a long time horizon (less biased, but higher variance estimate), and a short time horizon (more biased, but less varied estimate).

In this experience, we doubled the time horizon from 128 to 256 . Increasing it allows our agent to capture more important behaviors in his sequence of actions than before.

However, this didn’t have an impact on the training of our new agent.Indeed, they share quite the same results.

We published our trained models on github, you can download them here.

Diving deeper into Unity-ML Agents

That’s all for today!

You’ve just trained a smarter agent than last time. And you’ve also learned about Curiosity in Deep Reinforcement Learning. That’s awesome!

Now that we’ve done that, you might want to go deeper with Unity ML-Agents . Don’t worry, next time we’ll create our own environments and the article next we’ll create our own reinforcement learning implementations.

So in the next article, we’ll create our first environment from scratch.What this environment will be? I don’t want to spoil everything now, but I give you a hint:

Diving deeper into Unity-ML Agents

Say hello to Mr. Bacon :pig_nose:

See you next time!

If you have any thoughts, comments, questions, feel free to comment below or send me an email: hello@simoninithomas.com, or tweet me @ThomasSimonini .

Keep learning, stay awesome!


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

未来版图

未来版图

麻省理工科技评论 / 人民邮电出版社 / 2018-5-1 / CNY 69.80

《麻省理工科技评论》作为世界上历史悠久、影响力极大的技术商业类杂志,每年都会依据公司的科技领军能力和商业敏感度这两个必要条件,从全球范围内选取50家未来可能会成为行业主导的聪明公司。 这些聪明公司,并非都是行业巨头,甚至专利数量、公司所在地以及资金规模都不在考察范围内。 这些公司是“高精尖科技创新”与“能够保证公司利益* 大化的商业模式”的完 美融合。无论公办私营,无关规模大小,这些遍布全球......一起来看看 《未来版图》 这本书的介绍吧!

在线进制转换器
在线进制转换器

各进制数互转换器

XML、JSON 在线转换
XML、JSON 在线转换

在线XML、JSON转换工具

HEX CMYK 转换工具
HEX CMYK 转换工具

HEX CMYK 互转工具