多無人機系統在線強化學習最優安全跟蹤控制

2024-10-19 00:00:00弓鎮宇楊飛生

航空科學技術 2024年4期

摘要：在無人機（UAV）編隊跟蹤任務中，虛假數據注入（FDI）攻擊者可向控制指令注入誤導性數據，導致無人機無法形成指定的編隊構型，故需設計安全編隊跟蹤控制器。為此，本文利用零和圖博弈對攻防過程進行建模，其中FDI攻擊者和安全控制器是博弈的參與者，攻擊者的目標是最大化設定的成本函數，而安全控制器的目標與之相反，求解博弈并獲得最優安全控制策略依賴于求取Hamilton-Jacobi-Isaacs（HJI）方程的解。而HJI方程是耦合偏微分方程，難以直接求解，因此結合經驗回放機制引入了有限時間收斂的在線強化學習算法，設計了單評價神經網絡近似值函數并獲得了最優安全控制策略。最終利用仿真驗證了算法的有效性。

關鍵詞：FDI攻擊；多無人機；在線強化學習；優化控制；零和圖博弈

中圖分類號：V249.1 文獻標識碼：A DOI：10.19452/j.issn1007-5453.2024.04.004

基金項目：國家自然科學基金（62073269）；航空科學基金（2020Z034053002）；陜西省重點研發計劃項目（2022GY-244）；重慶市自然科學基金（CSTB2022NSCQ-MSX0963）；廣東省基礎與應用基礎研究基金（2023A1515011220）

無人機作為典型的無人系統，已經廣泛應用于農林作業、電力巡檢、災后搜救、目標偵察、協同作戰等領域[1]。相比于單體無人機的機載設備有限、感知范圍小、任務執行容錯能力差，整合了信息融合、目標分配、協同控制等技術的多無人機系統可在復雜周邊環境下完成多樣化任務。然而，無人機之間的信息交互依賴于通信網絡，因此多無人機系統面臨網絡攻擊的威脅。攻擊者可向無人機傳輸信道中注入欺騙型數據以降低系統性能，甚至導致任務失敗，因此設計安全控制方案以抵御虛假數據注入（FDI）攻擊至關重要。

目前，主要有兩種安全控制方案以應對FDI攻擊，其區別在于是否引入了攻擊檢測機制[2-3]。在無人機系統中，Lin Hong等[4]設計了攻擊檢測器和線性二次高斯控制器來抵抗FDI攻擊。Xiao Jiaping等[5]基于滑動新息序列提出了一種新型的攻擊檢測器。為了節約網絡通信資源，Yin Tingting等[6]研究了事件觸發機制下的無人機安全編隊控制。

隨著人工智能技術的發展，強化學習算法因其良好的決策優化和實時的策略選擇能力而備受關注，越來越多研究者將其應用于控制問題求解[7-10]。針對安全控制問題，Wu Chengwei等[11]利用Q學習算法研究了控制信號遭受FDI攻擊時的安全控制器設計。Zhou Yuanqiang等[12]結合威脅檢測水平函數設計了檢測機制并利用離策略算法求解了最優安全控制器。多智能體系統中，Moghadam 等[13]結合離策略算法求解了非齊次博弈Riccati方程并設計了彈性控制器。考慮事件觸發機制，Xu Yuanyuan等[14]結合預選器和觀測器研究了傳感器遭受FDI攻擊時的最優控制律設計。

已有文獻很少關注強化學習的快速收斂問題，Kokolakis等[15]設計了有限時間收斂的強化學習算法，但是其未考慮多無人機情形。本文在領導—跟隨多智能體框架下研究了多無人機遭受網絡攻擊時的安全編隊跟蹤控制問題。考慮到FDI攻擊者和安全控制器之間相互對抗的關系，引入了零和圖博弈理論對攻防過程進行建模，最優攻擊和最優安全控制器位于納什均衡點。為求解博弈并進一步獲取最優安全控制律，引入了有限時間收斂強化學習方法。通過單評價神經網絡架構對值函數進行逼近，采用了經驗回放機制以維持持續激勵條件，分析了算法的收斂性以及停息時間上界。

1 代數圖論

6 結束語

本文研究了無人機控制信號遭受FDI攻擊時的最優安全跟蹤控制問題，利用零和圖博弈理論對攻擊者與安全控制器之間的攻防進行了建模，引入了有限時間收斂的強化學習方法和單評價神經網絡架構，在線求解了最優安全控制器。在后續工作中，可基于無模型在線強化學習方法實現最優安全編隊跟蹤控制器設計。

[1]Wang Haijun， Zhao Haitao， Zhang Jiao， et al. Survey on unmanned aerial vehicle networks： A cyber physical system perspective [J]. IEEE Communications Surveys Tutorial， 2020， 22（2）： 1027-1070.

[2]Li Xiaomeng， Zhou Qi， Li Panshuo， et al. Event-triggered consensus control for multi-agent systems against 1 datainjection attacks [J]. IEEE Transactions on Cybernetics， 2020， 50（5）： 1856-1866.

[3]Tan Yushun， Liu Qingyi， Liu Jinliang， et al. Observer-based security control for interconnected semi-Markovian jump systems with unknown transition probabilities [J]. IEEE Transactions on Cybernetics， 2022， 52（9）： 9013-9025.

[4]Lin Hong， Sun Pei， Cai Chenxiao， et al. Secure LQG control for a quadrotor under 1 data injection attacks [J]. IET Control Theory Applications， 2022， 16（9）： 925-934.

[5]Xiao Jiaping， Feroskhan M. Cyber attack detection and isolation for a quadrotor UAV with modified sliding innovation sequences [J]. IEEE Transactions on Vehicular Technology， 2022， 71（7）： 7202-7214.

[6]Yin Tingting， Gu Zhou， Park J H， et al. Event-based intermittent formation control of multi-UAV systems under deception attacks [J]. IEEE Transactions on Neural Networks and Learning Systems， 2022， 12： 1-12.

[7]Lewis F L， Vrabie J， Vamvoudakis K G， et al. Reinforcement learning and feedback control： Using natural decision methods to design optimal adaptive controllers[J]. IEEE Control Systems Magazine， 2012， 32（6）： 76-105.

[8]Peng Zhinan， Luo Rui， Hu Jiangping， et al. Optimal tracking control of nonlinear multiagent systems using internal reinforce Q-learning [J]. IEEE Transactions on Neural Networks and Learning Systems， 2022， 33（8）： 4043-4055.

[9]Xie Kedi， Yu Xiao， Lan Weiyao. Optimal output regulation for unknown continuous-time linear systems by internal model and adaptive dynamic programming [J]. Automatica， 2022， 146： 1-7.

[10]Wei Qinglai， Zhu Liao， Li Tao， et al. A new approach to finitehorizon optimal control for discrete-time affine nonlinear systems via a pseudolinear method [J]. IEEE Transactions on Automatic Control， 2022， 67（5）： 2610-2617.

[11]Wu Chengwei， Li Xiaolei， Pan Wei， et al. Zero-sum game based optimal secure control under actuator attacks[J]. IEEE Transactions on Automatic Control， 2021， 66（8）： 3773-3780.

[12]Zhou Yuanqiang， Vamvoudakis K G， Haddad W M， et al. A secure control learning framework for cyber-physical systems under sensor and actuator attacks [J]. IEEE Transactions on Cybernetics， 2021， 51（9）： 4648-4660.

[13]Moghadam R， Modares H. Resilient autonomous control of distributed multiagent systems in contested environments [J]. IEEE Transactions on Cybernetics， 2019， 49（11）： 3957-3967.

[14]Xu Yuanyuan， Li Tieshan， Yang Yue， et al. Simplified ADP for event-triggered control of multiagent systems against FDI at‐tacks[J]. IEEE Transactions on Systems， Man， and Cybernet‐ics： System， 2023， 53（8）： 4672-4683.

[15]Kokolakis N-M T， Vamvoudakis K G. Safety-aware pursuitevasion games in unknown environments using Gaussian processes and finite-time convergent reinforcement learning[J]. IEEE Transactions on Neural Networks and Learning Systems， 2024，35（3）： 3130-3143.

Optimal Secure Tracking Control in Multi-UAVs Based on Online Reinforcement Learning

Gong Zhenyu， Yang Feisheng

Northwestern Polytechnical University， Xi’an 710072， China

Abstract： In Unmanned Aerial Vehicle （UAV） formation tracking missions， False Data Injection （FDI） attackers can inject misleading data into the control commands， resulting in the fact that UAVs can not form the specified formation configuration， so there is a need to design a secure formation tracking controller. The attack-defense process was modeled as a zero-sum graphical game， in which the FDI attacker and the secure controller were viewed as game players. The attacker aims to maximize the cost function yet the secure controller serves a contrary purpose. Solving the game and acquiring the optimal secure control policy rely on solving the Hamilton-Jacobi-Isaacs （HJI） equation. The HJI equation is a coupled partial differential equation， which is difficult to solve directly. Therefore， the finite-time convergent online reinforcement learning algorithm that combines the experience replay mechanism was introduced and the critic-only neural network was utilized to approximate the value function for obtaining the optimal secure control policy. A numerical simulation was given to show the effectiveness of the raised scheme.

Key Words： FDI attack； multi-UAVs； online reinforcement learning； optimal control； zero-sum graphical game

航空科學技術2024年4期

航空科學技術的其它文章: 復雜背景下空中小目標動態紅外成像仿真技術研究; 誤差狀態卡爾曼濾波的視覺慣性自適應融合定位方法研究; 基于數據驅動的機動目標跟蹤; 人工神經網絡在飛機下沉速度控制中的應用; 全電高速無人直升機電池風冷散熱能力研究; 基于模型預測控制的翼傘航跡跟蹤