Unreliable? The Problem with Deep Deterministic Policy Gradients (DDPG)

栏目: IT技术 · 发布时间: 5年前

内容简介:But it doesn’t stop there. We can see this saturation as a doorway to deadlock. After our agent’s actor stabilizes to a suboptimal state, DDPG perpetuates a cycle that is difficult to recover from. Here, we take a look at each of the cycle’s components, bu

The Deadlock Cycle

Photo by Chaitanya Tvs on Unsplash

But it doesn’t stop there. We can see this saturation as a doorway to deadlock. After our agent’s actor stabilizes to a suboptimal state, DDPG perpetuates a cycle that is difficult to recover from. Here, we take a look at each of the cycle’s components, but, if you would like to see the rigorous mathematical derivations, feel free to take a look here .

Deadlock Cycle as Shown in [1]

1. Q Tends to Q Conditioned on Policy

As our critic continually updates its parameters, its output doesn’t converge to the true, optimal Q-value, but rather the Q-value conditioned on our policy.

True Q-Value vs Q-Value Conditioned on Policy

This intuitively makes sense. Looking at the critic update equation, we directly feed in our policy’s actions to calculate the target value. But taken by itself, this doesn’t seem to be much of an issue. There are many methods like SARSA that use on-policy updates similar to this, so what’s wrong?

Q-Network Update Equations as Shown in [1]

This part of the cycle is problematic because our actor is already saturated. Our policy is stagnant. As a result, our algorithm keeps feeding our critic the same actions whenever updating, making the estimated Q-value stray from its true value.

2. Estimated Q is Piece-Wise Constant

Looking at equation 2, we notice that, in sparse environments, the reward term takes on a constant value very often. Without loss of generality, we can set this value to zero, since all transitions can be scaled accordingly.

Q-Value Conditioned on Policy with Value Substituted

So, we’re left with the second term. Notice how this term can be replaced with the value function conditioned on our policy. In sparse environments, this value function is dependent on two things: the number of steps until a rewarded state and the value of that reward. This value in itself is piece-wise constant, making the overall Q-value piece-wise constant as well.

3. Critic Gradients Approach Zero

As our Q-value tends towards the Q-value conditioned on our policy, it becomes more piece-wise constant.

This is an issue.

Because of this fact, local gradients become mostly flat, making it roughly equal to zero. Discontinuous function approximators rarely happen, so we can’t expect our gradients to perfectly equal zero. Regardless, our critic is being trained to match this piece-wise function, making it a valid approximation. Most importantly, the flatness prevents our agent from receiving any information on how to improve its policy.

4. Our Agent’s Policy Barely Changes

Then, we come full circle. As DDPG is a deterministic algorithm, our Q-value is always differentiated exactly at state s and a policy-given action. Coupled with the fact that our Q-value gradients are very close to zero, this prevents our actor from updating its policy properly, regardless of whether the reward is found regularly in future transitions. Then, we loop back to step one.

Policy Update as Shown in [1]

以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

知识发现

知识发现

史忠植 / 2011-1 / 59.00元

《知识发现(第2版)》全面而又系统地介绍了知识发现的方法和技术,反映了当前知识发现研究的最新成果和进展。全书共分15章。第1章是绪论,概述知识发现的重要概念和发展过程。下面三章重点讨论分类问题,包括决策树、支持向量机和迁移学习。第5章阐述聚类分析。第6章是关联规则。第7章讨论粗糙集和粒度计算。第8章介绍神经网络,书中着重介绍几种实用的算法。第9章探讨贝叶斯网络。第10章讨论隐马尔可夫模型。第11章......一起来看看 《知识发现》 这本书的介绍吧!

在线进制转换器
在线进制转换器

各进制数互转换器

图片转BASE64编码
图片转BASE64编码

在线图片转Base64编码工具

SHA 加密
SHA 加密

SHA 加密工具