連續空間中的隨機技能發現算法

2016-04-12 00:00:00欒詠紅劉全章鵬

現代電子技術 2016年10期

摘要：針對大規模、連續空間隨著狀態維度指數級增加造成的“維數災”問題，提出基于Option分層強化學習基礎框架的改進的隨機技能發現算法。通過定義隨機Option生成一棵隨機技能樹，構造一個隨機技能樹集合。將任務目標分成子目標，通過學習低階Option策略，減少因智能體增大而引起學習參數的指數增大。以二維有障礙柵格連續空間內兩點間最短路徑規劃為任務，進行仿真實驗和分析，實驗結果表明：由于Option被隨機定義，因此算法在初始性能上具有間歇的不穩定性，但是隨著隨機技能樹集合的增加，能較快地收斂到近似最優解，能有效克服因為維數災引起的難以求取最優策略或收斂速度過慢的問題。

關鍵詞：強化學習； Option；連續空間；隨機技能發現

中圖分類號： TN911?34； TP18 文獻標識碼： A 文章編號： 1004?373X（2016）10?0014?04

A random skill discovery algorithm in continuous spaces

LUAN Yonghong 1，2， LIU Quan2，3， ZHANG Peng2

（1. Suzhou Institute of Industrial Technology， Suzhou 215104， China； 2. Institute of Computer Science and Technology， Soochow University， Suzhou 215006， China； 3. MOE Key Laboratory of Symbolic Computation and Knowledge Engineering， Jilin University， Changchun 130012， China）

Abstract： In allusion to the large and continuous space’s “dimension curse” problem caused by the increase of state dimension exponential order， an improved random skill finding algorithm based on Option hierarchical reinforcement learning framework is proposed. A random skill tree set is generated via defining random Option to construct a random skill tree set. The task goal is divided into several sub?goals， and then the increase of learning parameter exponent due to the increase of the intelligent agent is reduced through learning low?order Option policy. The simulation experiment and analysis were implemented by taking a shortest path between any two points in two?dimension maze with barriers in the continuous space as the task. The experiment result shows that the algorithm may have some intermittent instability in the initial performance because Option is defined randomly， but it can be converged to the approximate optimal solution quickly with the increase of the random skill tree set， which can effectively overcome the problem being hard to obtain the optimal policy and slow convergence due to “dimension curse”.

Keywords： reinforcement learning； Option； continuous space； random skill discovery

0 引言

強化學習[1?2]（Reinforcement Learning，RL）是Agent通過與環境直接交互，學習狀態到行為的映射策略。經典的強化學習算法試圖在所有領域中尋求一個最優策略，這在小規模或離散環境中是很有效的，但是在大規模和連續狀態空間中會面臨著“維數災”的問題。為了解決“維數災”等問題，研究者們提出了狀態聚類法、有限策略空間搜索法、值函數逼近法以及分層強化學習等方法[3]。分層強化學習的層次結構的構建實質是通過在強化學習的基礎上增加抽象機制來實現的，也就是利用了強化學習方法中的原始動作和高層次的技能動作[3]（也稱為Option）來實現。

分層強化學習的主要研究目標之一是自動發現層次技能。近年來雖然有很多研究分層強化學習的方法，多數針對在較小規模的、離散領域中尋找層次技能。譬如Simsek與Osentoski等人通過劃分由最近經驗構成的局部狀態轉移圖來尋找子目標[4?5]。McGovern和Batro等根據狀態出現的頻率選擇子目標[6]。Matthew提出將成功路徑上的高頻訪問狀態作為子目標，Jong和Stone提出從狀態變量的無關性選擇子目標[7]。但是，這些方法都是針對較小規模、離散的強化學習領域。2009年Konidaris和Barto等人提出了在連續強化學習空間中的一種技能發現方法，稱為技能鏈[8]。2010年Konidaris又提出根據改變子目標點檢測方法[9]來分割每個求解路徑為技能的CST算法，這種方法僅限于路徑不是太長且能被獲取的情況。

本文介紹了一種在連續RL域的隨機技能發現算法。采用Option分層強化學習中自適應、分層最優特點，將每個高層次的技能定義為一個Option，且隨機定義的，方法的復雜度與復雜學習領域的Option構建數量成比例。雖然Option的隨機選擇可能不是最合適的，但是由于構建的Option不僅是一個技能樹還是一個技能樹的集合，因此彌補了這個不足之處。

1 分層強化學習與Option框架

分層強化學習（Hierarchical Reinforcement Learning，HRL）的核心思想是引入抽象機制對整個學習任務進行分解。在HRL方法中，智能體不僅能處理給定的原始動作集，同時也能處理高層次技能。

4 結語

實驗的性能結果表明了RSD算法能顯著提高連續域中RL問題的性能，通過采用隨機技能樹集合和對每個樹葉學習一個低階的Option策略。RSD算法的優點，與其他的技能發現方法相比，可以采用Option框架更好地處理RL連續域的問題，無需分析訓練集的圖或值自動創建Option。因此，它可以降低搜索特定Option的負擔，能使它更適應于大規模或連續狀態空間，能分析一些困難較大的領域問題。

參考文獻

[1] SUTTON R S， BARTO A G. Reinforcement learning： An introduction [M]. Cambridge， MA： MIT Press，1998.

[2] KAELBLING L P， LITTMAN M L， MOORE A W. Reinforcement learning： A survey [EB/OL]. [1996?05?01]. http：// www.cs.cmu.edu/afs/cs...vey.html.

[3] BARTO A G， MAHADEVAN S. Recent advances in hierarchical reinforcement learning [J]. Discrete event dynamic systems. 2003， 13（4）： 341?379.

[4] SIMSEK O， WOLFE A P， BARTO A G. Identifying useful subgoals in reinforcement learning by local graph partitioning [C]// Proceedings of the 22nd International Conference on Machine learning. USA： ACM， 2005， 8： 816?823.

[5] OSENTOSKI S， MAHADEVAN S. Learning state?action basis functions for hierarchical MDPs [C]// Proceedings of the 24th International Conference on Machine learning. USA： ACM， 2007， 7： 705?712.

[6] MCGOVERN A， BARTO A. Autonomous discovery of subgolas in reinfoeremente learning using deverse density [C]// Proceedings of the 8th Intemational Coference on Machine Learning. San Fransisco：Morgan Kaufmann， 2001： 36l?368.

[7] JONG N K， STONE P. State abstraction discovery from irrelevant state variables [J]. IJCAI， 2005， 8： 752?757.

[8] KONIDARIS G， BARTO A G. Skill discovery in continuous reinforcement learning domains using skill chaining [J]. NIPS， 2009， 8： 1015?1023.

[9] KONIDARIS G， KUINDERSMA S， BARTO A G， et al. Constructing skill trees for reinforcement learning agents from demonstration trajectories [J]. NIPS， 2010， 23： 1162?1170.

[10] 劉全，閆其粹，伏玉琛，等.一種基于啟發式獎賞函數的分層強化學習方法[J].計算機研究與發展，2011，48（12）：2352?2358.

[11] 沈晶，劉海波，張汝波，等.基于半馬爾科夫對策的多機器人分層強化學習[J].山東大學學報（工學版），2010，40（4）：1?7.

[12] KONIDARIS G， BARTO A. Efficient skill learning using abstraction selection [C]// Proceedings of the 21st International Joint Conference on Artificial Intelligence. Pasadena， CA， USA： [S.l.]， 2009： 1107?1113.

[13] XIAO Ding， LI Yitong， SHI Chuan. Autonomic discovery of subgoals in hierarchical reinforcement learning [J]. Journal of china universities of posts and telecommunications， 2014， 21（5）： 94?104.

[14] CHEN Chunlin， DONG Daoyi， LI Hanxiong， et al. Hybrid MDP based integrated hierarchical Q?learning [J]. Science China （information sciences）， 2011， 54（11）： 2279?2294.

現代電子技術2016年10期

現代電子技術的其它文章: 基于網格化的城市配電網綜合評價體系研究; 基于北斗衛星的輸電桿塔在線監測系統; 基于S3C6410的智能家居遠程監控系統的設計與實現; 一種綜合利用各種能源的新型LED路燈; 基于弱關聯挖掘技術的電網故障自動診斷研究; 電池管理系統的設計