基于值函數估計的參數探索策略梯度算法

2023-12-31 00:00:00趙婷婷楊夢楠陳亞瑞王嫄楊巨成

計算機應用研究 2023年8期

摘要：策略梯度估計方差大是策略梯度算法存在的普遍問題，基于參數探索的策略梯度算法（PGPE）通過使用確定性策略有效緩解了這一問題。然而，PGPE算法基于蒙特卡羅方法進行策略梯度的估計，需要大量學習樣本才能保證梯度估計相對穩定，因此，梯度估計方差大阻礙了其在現實問題中的實際應用。為進一步減小PGPE算法策略梯度估計的方差，提出了基于值函數估計的參數探索策略梯度算法（PGPE-FA），該算法在PGPE算法中引入Actor-Critic框架。具體地，提出的方法使用價值函數估計策略梯度，代替了PGPE方法使用軌跡樣本估計策略梯度的方式，從而減小了梯度估計方差。最后，通過實驗驗證了所提算法能夠減小梯度估計的方差。

關鍵詞：強化學習；值函數；參數探索策略梯度；梯度估計方差

中圖分類號：TP181文獻標志碼：A

文章編號：1001-3695（2023）08-025-2404-07

doi：10.19734/j.issn.1001-3695.2022.11.0781

Function approximation for policy gradients with parameter-based exploration

Zhao Tingting， Yang Mengnan， Chen Yarui Wang Yuan， Yang Jucheng

（College of Artificial Intelligence， Tianjin University of Science amp; Technology， Tianjin 300457， China）

Abstract：Policy gradient algorithms suffer from the large variance of gradient estimation. the algorithm of policy gradient with parameter based exploration mitigates this problem to some extent. However， PGPE estimates its gradient based on the Monte Carlo， which requires a large number of samples to achieve the fairly stable policy update. And thus hinders its application in the real world problem. In order to further reduce the variance of policy gradient， the algorithm of function approximation for policy gradients with parameter-based exploration （PGPE-FA） implements the algorithm of PGPE in the Actor-Critic framework. More specifically， the proposed method utilized value function to estimate the policy gradient， instead of using trajectory samples to estimate the policy gradient as PGPE method does， thereby reducing the variance of gradient estimation. Finally， the experiment verifies that the proposed algorithm can reduce the variance of gradient estimation.

Key words：reinforcement learning; value function; policy gradients with parameter based exploration; variance of gradient estimates

0 引言

強化學習（reinforcement learning，RL）［1］是一種通過與環境交互和試錯進行學習的學習范式［2，3］，其目標是找到一個最優策略，使智能體能夠得到最大的期望累積獎勵。隨著深度神經網絡的加入，深度強化學習在商業［4］、游戲［5～9］、控制［10，11］等領域取得了突破性進展。

強化學習的主要目的是學習最優策略從而獲得最大累計獎勵回報。根據策略的學習方式，強化學習算法可以分為兩類：一類是基于值函數（value-based）的強化學習方法［12］，主要處理離散空間問題；另一類是基于策略（policy-based）的強化學習方法［13］，主要處理連續動作空間問題。value-based算法是早在20世紀80年代末就被提出且得到廣泛使用的傳統強化學習算法，其中最具代表性的算法包括Watkins等人［14］提出的Q-Learning算法、Rummery等人［15］提出的SARSA算法、DeepMind［16，17］提出的Deep Q-Learning（DQN）算法。上述方法需要先進行策略評估，得到狀態價值函數或動作價值函數信息，再利用值函數改善當前的策略。此類方法需要找到與動作相關的最大值函數來改進策略，難以處理連續的動作。因此，基于值函數的學習算法在機器人等智能控制系統環境中并不能直接適用。另一方面，針對基于值函數方法的局限性，policy-based算法直接對策略進行學習，適用于解決具有連續動作空間的復雜決策任務［1］。目前為止，最具代表性的策略搜索算法包括REINFORCE［18］、trust region policy optimization（TRPO）［19］、proximal policy optimization algorithms（PPO）［20］等。

在policy-based方法中，策略梯度算法（policy gradients）是最實用、最易于實現且被廣泛應用的一種方法，由于此類方法中策略的更新是逐漸變化的，能夠確保系統的穩定性，尤其適用于復雜智能系統的決策控制問題，如機器人［21］。然而，Williams［22］提出的傳統策略梯度算法REINFORCE，梯度估計方差過大，使得算法不穩定且收斂慢。REINFORCE算法利用采樣的真實路徑通過蒙特卡羅法（MC）估計策略梯度，由于環境及策略的不確定性，一個策略能產生多條路徑以及路徑累積回報。因此，為了得到準確且穩定的策略梯度估計，REINFORCE方法需要大量的真實路徑樣本。然而，收集大量學習樣本是強化學習領域在實際應用中存在的瓶頸問題。因此，不充足的交互會給軌跡回報引入較大的方差，最終導致梯度估計的方差很大。

針對策略梯度算法中梯度估計方差大的問題，Sehnke等人［23］提出了一種基于參數探索的策略梯度的方法（policy gradients with parameter based exploration，PGPE）。PGPE通過去除策略中不必要的隨機性，并利用策略參數的先驗分布引入有用的隨機性來產生低方差的梯度估計。具體地，PGPE方法學習策略參數的先驗分布并從中隨機采樣策略參數，然后使用確定性策略，從而在一定程度上緩解了REINFORCE方法由于使用隨機策略而產生較大梯度估計方差的問題。然而，PGPE方法在計算策略梯度時仍需要使用真實路徑回報，即需要從先驗分布中采樣大量的策略參數生成大量的軌跡樣本及其路徑回報，從而保證策略梯度估計的穩定性。因此，PGPE算法與REINFORCE算法均使用MC方法估計策略梯度，此類更新方式通常需要不斷與環境交互進行大量采樣才能保證梯度估計的準確性。但在實際應用環境中，采樣成本通常較為高昂且非常費時，因此使用MC方法更新梯度的算法通常都會存在由于樣本收集的不充分而造成策略梯度估計方差大的問題。此外，由于每次需要根據一個策略采樣完整的軌跡才能計算路徑累計回報，參與梯度計算。因此，基于MC方法的策略梯度估計方法通常存在樣本利用率低的問題。

另一方面，Actor-Critic（AC）方法結合價值函數改進了上述策略梯度中估計方差較大和學習速率慢的問題［24，25］。AC框架的本質就是在基于策略的方法中引入值函數。通過上述對REINFORCE方法和PGPE方法的分析可知，由于采樣時間及成本的限制，采樣大量樣本準確估計路徑期望回報是難以實現的。在強化學習中狀態—動作值函數表示從狀態s出發根據策略采取動作后得到的期望累積回報。因此，Sutton等人［24］提出了學習價值函數并利用其參與梯度計算減少梯度估計方差的思想。AC框架引入了值函數，在求解值函數時通常使用時間差分法（temporal difference，TD），它允許使用后續狀態的值函數估計當前值函數，即可以在每一步估計當前值函數，而無須像MC方法等到回合結束才能進行參數的更新，從而很大程度地提高學習速度［26］。此外，AC框架結合了深度學習，利用深度神經網絡優異的特征表示能力，不僅可以對不同狀態、動作下的價值函數進行擬合，也可以擬合強化學習中的策略，因此，結合了深度神經網絡的強化學習算法在性能上有了更大的提升［27］。

AC框架兼備 policy-based 方法和value-based 方法兩方面的優勢，value-based方法使用價值函數估計方差較小且樣本利用率高，policy-based方法能夠處理連續空間問題且收斂性較好［28］。在AC框架中，Actor扮演策略這一角色，用于控制智能體生成動作，而Critic則根據值函數評估智能體動作的好壞，并指導Actor對策略進行改進，由于Critic對預期回報的估計使得Actor在進行梯度更新時方差較小，加快了學習過程。通常情況下，AC方法被認為是一類 policy-based 方法，可解決包括離散動作空間及連續動作空間在內的各種決策問題，特殊之處在于使用價值作為策略梯度的基準，是 policy-based 方法對估計方差的改進［28］。由于AC方法的優勢，近年來，發展出眾多改進的AC算法，最具代表性的算法包括：確定性策略梯度算法（deterministic policy gradient，DPG）［29］及其改進算法（deep deterministic policy gradient，DDPG）［30］、異步優勢Actor-Critic算法（asynchronous advantage Actor-Critic，A3C）［31］、雙延遲確定性策略梯度算法（twin delayed deep deterministic policy gradient，TD3）［32］等。然而，上述方法本質上均基于REINFORCE算法進行策略梯度的估計。

綜上，本文借助深度神經網絡強大的表征能力來學習PGPE算法的超參數及擬合價值函數，并使用學習到的函數指導策略更新，從而得到性能更優的基于PGPE算法的AC框架。具體地說，本文一方面使用價值函數估計策略梯度，改進了傳統PGPE方法使用軌跡樣本估計策略梯度的方式，減少了梯度估計方差，加快了學習速度。另一方面，引入變分自編碼器（variational auto-encoder，VAE）［33］中使用的一種名為重參數化技巧的方法，實現了使用神經網絡學習PGPE中的超參數的思想，進一步提升了PGPE算法的性能。最后，通過大量實驗驗證所提算法的有效性和準確性。

1 背景知識

1.1 強化學習建模

1.2 傳統策略梯度方法

1.3 基于參數探索的策略梯度算法

2 基于值函數估計的參數探索策略梯度算法

3 實驗結果

本文首先通過機器人連續鏈式行走任務驗證所提算法的有效性。然后，通過經典的倒立雙擺平衡問題進一步探索本文算法的性能優勢，并對實驗結果進行分析。

3.1 連續鏈式行走任務實驗

3.1.1 環境設置

3.1.2 算法性能對比實驗

3.1.3 方差

3.1.4 超參數更新軌跡

3.1.5 估計梯度方向

3.2 倒立雙擺平衡問題

3.2.1 環境設置

3.2.2 算法性能對比實驗

結果表明，DPG算法收斂最快，但性能最差，這是因為DPG算法引入了AC框架，且使用確定性策略，極大減小了梯度估計方差。然而，DPG方法中沒有探索，極易陷入局部最優，從而使得性能最差。另一方面，PGPE方法收斂速度相對較慢，但PGPE算法與DPG方法相比，通過引入策略參數的先驗分布增加了必要的探索性，因此性能優于DPG方法。本文算法PGPE-FA性能最好，收斂速度也比PGPE算法快，主要原因在于它引入AC框架，使用Q函數估計預期回報指導策略更新方差較小，收斂較快，同時引入策略參數的先驗分布增加了必要的探索性，從而能得到較好的性能且提高了算法收斂速度。

4 結束語

本文針對PGPE算法策略梯度估計方差大的問題，提出了一種基于值函數估計的參數探索策略梯度算法。具體地，本文在PGPE方法中引入了Actor-Critic框架，即在PGPE方法中引入了值函數，通過使用值函數估計策略梯度，降低了PGPE方法梯度估計的方差，加快了PGPE方法的收斂速度。最后，通過實驗證明了本文方法能通過引入值函數有效改善PGPE方法梯度估計方差大的問題。在未來的工作中，筆者將研究如何給PGPE-FA算法的Actor網絡Critic網絡添加目標網絡，增加算法的穩定性。

附錄

在平均回報下證明：

證明完成。

參考文獻：

［1］趙婷婷，吳帥，楊夢楠，等. 基于互信息最大化的意圖強化學習方法的研究［J］. 計算機應用研究， 2022，39（11）： 3327-3332，3364. （Zhao Tingting， Wu Shuai， Yang Mengnan， et al. Intention based reinforcement learning by information maximization［J］. Application Research of Computers， 2022，39（11）： 3327-3332，3364.）

［2］何立，沈亮，李輝，等. 強化學習中的策略重用：研究進展［J］. 系統工程與電子技術， 2022，44（3）： 884-899. （He Li， Shen Liang， Li Hui， et al. The policy reuse in reinforcement learning：research progress［J］. Systems Engineering and Electronics， 2022，44（3）： 884-899.）

［3］孔松濤，劉池池，史勇，等. 深度強化學習在智能制造中的應用展望綜述［J］. 計算機工程與應用， 2021，57（2）： 49-59. （Kong Songtao， Liu Chichi， Shi Yong， et al. A survey on the application of deep reinforcement learning in intelligent manufacturing［J］. Computer Engineering and Applications， 2021，57（2）： 49-59.）

［4］Silver D， Newnham L， Barker D， et al. Concurrent reinforcement learning from customer interactions［C］//Proc of International Confe-rence on Machine Learning. 2013： 924-932.

［5］Silver D， Huang A， Maddison C J， et al. Mastering the game of Go with deep neural networks and tree search［J］. Nature， 2016，529（7587）： 484-489.

［6］Silver D， Hubert T， Schrittwieser J， et al. A general reinforcement learning algorithm that masters chess， shogi， and Go through self-play［J］. Science， 2018，362（6419）： 1140-1144.

［7］Ye Deheng， Chen Guibin， Zhao Peilin，et al. Supervised learning achieves human-level performance in MOBA games： a case study of honor of kings［J］. IEEE Trans on Neural Networks and Lear-ning Systems， 2022，33（3）： 908-918.

［8］Ye Deheng， Liu Zhao， Sun Mingfei， et al. Mastering complex control in moba games with deep reinforcement learning［J］. Proceedings of the AAAI Conference on Artificial Intelligence， 2020，34（4）： 6672-6679.

［9］Vinyals O， Babuschkin I， Czarnecki W M， et al. Grandmaster level in StarCraft Ⅱ using multi-agent reinforcement learning［J］. Nature， 2019， 575（7782）： 350-354.

［10］Levine S， Pastor P， Krizhevsky A， et al. Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection［J］. The International Journal of Robotics Research， 2018，37（4-5）： 421-436.

［11］Levine S， Finn C， Darrell T， et al. End-to-end training of deep visuomotor policies［J］. The Journal of Machine Learning Research， 2016，17（1）： 1334-1373.

［12］劉全，翟建偉，章宗長，等.深度強化學習綜述［J］.計算機學報，2018，41（1）： 1-27. （Liu Quan， Zhai Jianwei， Zhang Zongchang， et al. A survey on deep reinforcement learning［J］. Chinese Journal of Computers， 2018， 41（1）： 1-27.）

［13］劉建偉，高峰，羅雄麟，等. 基于值函數和策略梯度的深度強化學習綜述［J］.計算機學報，2019，42（6）：1406-1438. （Liu Jianwei， Gao Feng， Luo Xionglin，et al. Survey of deep reinforcement learning based on value function and policy gradient［J］. Chinese Journal of Computers， 2019，42（6）：1406-1438.）

［14］Watkins C J C H， Dayan P. Q-learning［J］. Machine Learning， 1992，8（3）： 279-292.

［15］Rummery G A， Niranjan M. On-line Q-learning using connectionist systems［M］. Cambridge， UK： University of Cambridge， 1994.

［16］Mnih V， Kavukcuoglu K， Silver D， et al. Playing Atari with deep reinforcement learning［C］//Proc of Workshops at the 26th Neural Information Processing Systems. 2013： 201-220.

［17］Mnih V， Kavukcuoglu K， Silver D， et al. Human-level control through deep reinforcement learning［J］. Nature， 2015，518（7540）： 529-533.

［18］Williams R J. Simple statistical gradient-following algorithms for connectionist reinforcement learning［J］. Machine Learning， 1992，8（3）： 229-256.

［19］Schulman J， Levine S， Abbeel P， et al. Trust region policy optimization［C］//Proc of International Conference on Machine Learning. 2015： 1889-1897.

［20］Schulman J， Wolski F， Dhariwal P， et al. Proximal policy optimization algorithms［EB/OL］. （2017） . https：//arxiv. org/abs/1707. 06347.

［21］Peters J， Schaal S. Policy gradient methods for robotics［C］//Proc of IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway，NJ： IEEE Press， 2006： 2219-2225.

［22］Williams R J. Simple statistical gradient-following algorithms for connectionist reinforcement learning［J］. Machine Learning， 1992，8（3）： 229-256.

［23］Sehnke F， Osendorfer C， Rückstie B， et al. Parameter-exploring policy gradients［J］. Neural Networks， 2010，23（4）： 551-559.

［24］Sutton R S， McAllester D， Singh S， et al. Policy gradient methods for reinforcement learning with function approximation［J］. Advances in Neural Information Processing Systems， 1999，12（1）： 1057-1063.

［25］Konda V R， Tsitsiklis J N. On actor-critic algorithms［J］. SIAM Journal on Control and Optimization， 2003，42（4）： 1143-1166.

［26］趙婷婷. 統計策略搜索強化學習方法及應用［M］. 北京：電子工業出版社， 2021. （Zhao Tingting. Statistical policy search reinforcement learning methods and applications［M］. Beijing： Publishing House of Electronics Industry， 2021.）

［27］楊思明，單征，丁煜，等. 深度強化學習研究綜述［J］. 計算機工程， 2021，47（12）： 19-29. （Yang Siming， Shan Zheng， Ding Yu， et al. A review of deep reinforcement learning［J］. Computer Engineering， 2021，47（12）：19-29.）

［28］李茹楊，彭慧民，李仁剛，等. 強化學習算法與應用綜述［J］. 計算機系統應用， 2020，29（12）： 13-25. （Li Ruyang， Peng Huimin， Li Rengang， et al. Overview on algorithms and applications for reinforcement learning［J］. Computer Systems amp; Applications， 2020，29（12）：13-25.）

［29］Silver D， Lever G， Heess N， et al. Deterministic policy gradient algorithms［C］//Proc of International Conference on Machine Learning. 2014： 387-395.

［30］Lillicrap T P， Hunt J J， Pritzel A， et al. Continuous control with deep reinforcement learning［EB/OL］. （2015） . https：//arxiv.org/abs/1509.02971.pdf.

［31］Mnih V， Badia A P， Mirza M， et al. Asynchronous methods for deep reinforcement learning［C］//Proc of International Conference on Machine Learning. 2016： 1928-1937.

［32］Fujimoto S， Hoof H， Meger D. Addressing function approximation error in actor-critic methods［C］//Proc of International Conference on Machine Learning. 2018： 1587-1596.

［33］Im D J， Ahn S， Memisevic R， et al. Auto-encoding variational Bayes［EB/OL］. （2014） . https：//arxiv. org/abs/1312. 6114.

［34］Kingma D P， Ba J. Adam： a method for stochastic optimization［EB/OL］. （2014） . https：//arxiv.org/abs/1412.6980.

［35］Cheng G， Hyon S H， Morimoto J， et al. CB： a humanoid research platform for exploring neuroscience［J］. Advanced Robotics， 2007，21（10）： 1097-1114.

計算機應用研究2023年8期

計算機應用研究的其它文章: 下期要目; 特征挖掘與區域增強的弱監督時序動作定位; 基于時空注意的毫米波雷達人體活動識別網絡; 胸部X線影像和診斷報告的雙塔跨模態檢索; 基于邊緣關聯點云的激光雷達與相機外參標定方法; 一種SOFC燃燒室燃燒狀態識別方法