摘 要:目前,基于多模態(tài)融合的語(yǔ)音情感識(shí)別模型普遍存在無(wú)法充分利用多模態(tài)特征之間的共性和互補(bǔ)性、無(wú)法借助樣本特征間的拓?fù)浣Y(jié)構(gòu)特性對(duì)樣本特征進(jìn)行有效地優(yōu)化和聚合,以及模型復(fù)雜度過(guò)高的問(wèn)題。為此,引入圖神經(jīng)網(wǎng)絡(luò),一方面在特征優(yōu)化階段,將經(jīng)過(guò)圖神經(jīng)網(wǎng)絡(luò)優(yōu)化后的文本特征作為共享表示重構(gòu)基于聲學(xué)特征的鄰接矩陣,使得在聲學(xué)特征的拓?fù)浣Y(jié)構(gòu)特性中包含文本信息,達(dá)到多模態(tài)特征的融合效果;另一方面在標(biāo)簽預(yù)測(cè)階段,借助圖神經(jīng)網(wǎng)絡(luò)充分聚合當(dāng)前節(jié)點(diǎn)的鄰接節(jié)點(diǎn)所包含的相似性信息對(duì)當(dāng)前節(jié)點(diǎn)特征進(jìn)行全局優(yōu)化,以提升情感識(shí)別準(zhǔn)確率。同時(shí)為防止圖神經(jīng)網(wǎng)絡(luò)訓(xùn)練過(guò)程中可能出現(xiàn)的過(guò)平滑問(wèn)題,在圖神經(jīng)網(wǎng)絡(luò)訓(xùn)練前先進(jìn)行圖增強(qiáng)處理。在公開(kāi)數(shù)據(jù)集IEMOCAP 和RAVDESS上的實(shí)驗(yàn)結(jié)果表明,所提出的模型取得了比基線模型更高的識(shí)別準(zhǔn)確率和更低的模型復(fù)雜度,并且模型各個(gè)組成部分均對(duì)模型性能提升有所貢獻(xiàn)。
關(guān)鍵詞:語(yǔ)音情感識(shí)別; 多模態(tài)特征; 圖神經(jīng)網(wǎng)絡(luò); 圖增強(qiáng)
中圖分類(lèi)號(hào):TP391 文獻(xiàn)標(biāo)志碼:A
文章編號(hào):1001-3695(2023)08-007-2286-06
doi:10.19734/j.issn.1001-3695.2023.01.0002
Speech emotion recognition based on multi-modal fusion of
graph neural network
Li Zijing, Chen Ning
(School of Information Science amp; Engineering, East China University of Science amp; Technology, Shanghai 200237, China)
Abstract:At present, speech emotion recognition models based on multi-modal fusion generally suffer from the inability to make full use of the commonality and complementarity between multimodal features, the inability to effectively optimize and aggregate sample features by using the topological structure characteristics between sample features, and the high complexity of existing models.Therefore, this paper introduced graph neural network. On the one hand, in the feature optimization stage, it used the text features optimized by the graph neural network as a shared representation to reconstruct the adjacency matrix based on acoustic features, so that the topological structure characteristics of the acoustic features contained text information, thus achieving multi-modal fusion. On the other hand, in the label prediction stage, it used the graph neural network to fully aggregate the similarity information contained in the adjacent nodes of the current node to optimize the characteristics of the current node globally to improve the accuracy of emotion recognition. At the same time, in order to prevent the over-smoothing problem that might occur during the training of the graph neural network, it performed graph augmentation before the graph neural network training. The experimental results on the public datasets IEMOCAP and RAVDESS show that the proposed model achieves higher recognition accuracy and lower model complexity than the baseline models, and each component of the model contributes to the improvement of model performance.
Key words:speech emotion recognition; multi-modal feature; graph neural network; graph augmentation
0 引言
語(yǔ)音情感識(shí)別(speech emotion recognition,SER)是情感計(jì)算中的熱點(diǎn)方向[1],旨在利用計(jì)算機(jī)模擬人類(lèi)思維,從語(yǔ)音信號(hào)中提取特征,并構(gòu)建特征與具體情感之間的映射關(guān)系,實(shí)現(xiàn)對(duì)語(yǔ)音情感的識(shí)別。SER技術(shù)在智能化系統(tǒng)中有廣泛的應(yīng)用。例如在醫(yī)療中,可以通過(guò)分析情感障礙患者的語(yǔ)音識(shí)別其情感狀態(tài),輔助心理醫(yī)生進(jìn)行病情的診斷[2];在遠(yuǎn)程教育中,通過(guò)分析教師上課過(guò)程中的語(yǔ)音情感,評(píng)價(jià)其上課表現(xiàn)[3];在智能客服中,借助客戶(hù)語(yǔ)音情感識(shí)別的結(jié)果,調(diào)整客服人員的服務(wù)策略以提高客戶(hù)滿(mǎn)意度[4];在汽車(chē)駕駛中,實(shí)時(shí)監(jiān)測(cè)駕駛員的語(yǔ)音情感狀態(tài),一旦出現(xiàn)不良情緒,可及時(shí)進(jìn)行語(yǔ)音提醒以促進(jìn)安全駕駛[5]。傳統(tǒng)基于機(jī)器學(xué)習(xí)的SER模型大多從原始音頻中提取聲學(xué)特征,例如梅爾頻率倒譜系數(shù)(Mel-frequency cepstral coefficient,MFCC)、濾波器組特征(filter bank,fbanks)等,然后將其輸入標(biāo)簽預(yù)測(cè)器,例如高斯混合模型(Gaussian mixture model,GMM)[6]、隱馬爾可夫模型(hidden Markov mo-del,HMM)[7]、支持向量機(jī)(support vector machine,SVM)[8]等中進(jìn)行情感識(shí)別。隨著計(jì)算能力的提高,深度學(xué)習(xí)方法成為主流,研究者相繼將卷積神經(jīng)網(wǎng)絡(luò)(convolutional neural networks,CNN)[9~11]、長(zhǎng)短期記憶網(wǎng)絡(luò)(long short-term memory,LSTM)[12,13]以及CNN-LSTM混合模型[14,15]應(yīng)用于SER任務(wù),顯著提高了情感識(shí)別的準(zhǔn)確率。
近年來(lái),圖卷積網(wǎng)絡(luò)(graph convolutional network,GCN)[16]快速發(fā)展,由于其可以很好地利用節(jié)點(diǎn)特征之間的拓?fù)浣Y(jié)構(gòu)特性實(shí)現(xiàn)節(jié)點(diǎn)特征的優(yōu)化,提高分類(lèi)準(zhǔn)確率,并且模型參數(shù)較少便于訓(xùn)練。近期,也被引入SER任務(wù)中,例如Shirian等人[17]對(duì)提取的聲學(xué)特征構(gòu)建圖結(jié)構(gòu),利用GCN進(jìn)行圖分類(lèi),以實(shí)現(xiàn)語(yǔ)音情感識(shí)別。Liu等人[18]則利用LSTM提取的高級(jí)聲學(xué)特征構(gòu)建圖結(jié)構(gòu),并利用圖同構(gòu)網(wǎng)絡(luò)(graph isomorphism network,GIN)進(jìn)行語(yǔ)音情感識(shí)別。
由于人類(lèi)的情感可通過(guò)多模態(tài)信息表達(dá),如語(yǔ)音、面部表情、文本等,所以為了充分利用不同模態(tài)信息在表達(dá)人類(lèi)情感方面的共性和互補(bǔ)性,近期學(xué)者提出了基于多模態(tài)信息融合的語(yǔ)音情感識(shí)別模型[19~22]。Yoon等人[19]分別構(gòu)建深度神經(jīng)網(wǎng)絡(luò)學(xué)習(xí)聲學(xué)特征和文本特征,并將它們串聯(lián)起來(lái)作為多模態(tài)特征進(jìn)行情感識(shí)別。Li等人[20]進(jìn)一步提出FG-CME模型。該模型提出了跨模態(tài)特征融合模塊,對(duì)聲學(xué)和文本模態(tài)的高級(jí)特征進(jìn)行交叉融合,以提升識(shí)別效果。
盡管FG-CME模型[20]相較于單模態(tài)的語(yǔ)音情感識(shí)別模型取得了更好的性能,但仍有很大的改進(jìn)空間,主要問(wèn)題有:a)FG-CME模型提出的跨模態(tài)特征融合模塊需要分別對(duì)提取的聲學(xué)特征和文本特征進(jìn)行線性變換,再將線性變換后的特征與未線性變換的特征進(jìn)行交叉點(diǎn)積,得到兩個(gè)融合語(yǔ)音和文本信息的特征,再將這兩個(gè)特征拼接起來(lái),得到最終的融合特征。但此融合方法計(jì)算復(fù)雜度過(guò)高,不利于實(shí)現(xiàn)實(shí)時(shí)性,且沒(méi)有對(duì)多模態(tài)特征進(jìn)行更進(jìn)一步的優(yōu)化以充分利用多模態(tài)特征之間的共性和互補(bǔ)性;b)FG-CME模型采用LSTM網(wǎng)絡(luò)作為標(biāo)簽預(yù)測(cè)器,由于該方法忽略樣本集內(nèi)樣本之間的相似性關(guān)系對(duì)樣本特征的優(yōu)化作用,所以較難達(dá)到理想的情感識(shí)別效果。
針對(duì)以上問(wèn)題,本文提出一種圖優(yōu)化(graph optimization,GO)與圖增強(qiáng)GCN(graph augmentation GCN,GAGCN)結(jié)合的模型GO-GAGCN。其中,GO模塊用于對(duì)文本特征進(jìn)行優(yōu)化并利用優(yōu)化后的文本特征重構(gòu)聲學(xué)特征的鄰接矩陣,使得聲學(xué)特征的拓?fù)浣Y(jié)構(gòu)特性中具有文本信息,相比于FG-CME模型更加充分地利用多模態(tài)特征間的共性和互補(bǔ)性;最后使用GAGCN模塊進(jìn)行標(biāo)簽預(yù)測(cè),利用節(jié)點(diǎn)特征間的拓?fù)浣Y(jié)構(gòu)特性對(duì)節(jié)點(diǎn)特征進(jìn)行全局優(yōu)化,并借助圖增強(qiáng)緩解GCN訓(xùn)練過(guò)程中可能出現(xiàn)的過(guò)擬合[23,24]問(wèn)題,進(jìn)一步提升情感識(shí)別準(zhǔn)確率。綜上所述,本文主要貢獻(xiàn)如下:
a)提出一種基于圖神經(jīng)網(wǎng)絡(luò)多模態(tài)融合的語(yǔ)音情感識(shí)別模型,該模型能夠更好地利用多模態(tài)特征信息且實(shí)時(shí)性高。
b)采用GO模塊對(duì)多模態(tài)特征進(jìn)行優(yōu)化和重構(gòu),從而充分利用多模態(tài)特征間的共性和互補(bǔ)性。
c)采用GAGCN模塊進(jìn)行標(biāo)簽預(yù)測(cè),借助GCN對(duì)節(jié)點(diǎn)特征進(jìn)行全局優(yōu)化,并進(jìn)行圖增強(qiáng)處理緩解GCN過(guò)擬合問(wèn)題。
d)在兩個(gè)情感數(shù)據(jù)集IEMOCAP[25]和RAVDESS[26]上進(jìn)行對(duì)比實(shí)驗(yàn),并針對(duì)GO-GAGCN各個(gè)組成部分進(jìn)行消融實(shí)驗(yàn)。結(jié)果證明GO-GAGCN模型更具競(jìng)爭(zhēng)力且各組成部分合理有效。
1 算法模型
如圖1所示,本文提出的GO-GAGCN模型大致可分為特征提取、圖構(gòu)造、圖優(yōu)化以及標(biāo)簽預(yù)測(cè)四個(gè)階段。
1.1 特征提取
1.1.1 聲學(xué)特征提取
1.1.2 文本特征提取
1.2 圖構(gòu)造
1.3 圖優(yōu)化
語(yǔ)音的聲學(xué)特征和相應(yīng)的文本特征在表現(xiàn)語(yǔ)音所蘊(yùn)涵的情感方面具有很好的共性和互補(bǔ)性。本文利用一種模態(tài)特征構(gòu)建的圖的拓?fù)浣Y(jié)構(gòu)特性對(duì)另一種模態(tài)進(jìn)行優(yōu)化。具體來(lái)說(shuō),考慮到GloVe模型為預(yù)訓(xùn)練模型,由其提取的文本特征魯棒性強(qiáng)、歧義度低,因此采用由文本特征構(gòu)建的圖的拓?fù)浣Y(jié)構(gòu)特性來(lái)優(yōu)化聲學(xué)模態(tài)圖結(jié)構(gòu)。本文提出的圖優(yōu)化的基本思想是利用經(jīng)GCN優(yōu)化的文本特征對(duì)基于聲學(xué)特征的鄰接矩陣和基于文本特征的鄰接矩陣分別進(jìn)行重構(gòu),并利用雙重重構(gòu)損失約束GCN訓(xùn)練,使得基于聲學(xué)特征的鄰接矩陣包含聲學(xué)信息。然后,將重構(gòu)后的兩個(gè)鄰接矩陣進(jìn)行融合,同樣,將優(yōu)化后的文本特征與原始聲學(xué)特征進(jìn)行融合,從而得到最終的融合圖G=(V,A)。
本模塊由文本特征優(yōu)化及鄰接矩陣重構(gòu)和多模態(tài)特征融合兩部分組成。
1.3.1 文本特征優(yōu)化及鄰接矩陣重構(gòu)
1.3.2 多模態(tài)特征融合
1.4 標(biāo)簽預(yù)測(cè)
為了對(duì)融合圖G=(V,A)的節(jié)點(diǎn)標(biāo)簽進(jìn)行更為準(zhǔn)確的預(yù)測(cè),本文利用GAGCN模塊進(jìn)行標(biāo)簽預(yù)測(cè)。GAGCN由圖增強(qiáng)(graph augmentation,GA)以及GCN兩部分組成。首先對(duì)G進(jìn)行圖增強(qiáng);然后,利用GCN借助增強(qiáng)圖的拓?fù)浣Y(jié)構(gòu)特性對(duì)節(jié)點(diǎn)特征進(jìn)行全局優(yōu)化;最后,通過(guò)融合各增強(qiáng)圖的節(jié)點(diǎn)標(biāo)簽預(yù)測(cè)結(jié)果得到最終的情感識(shí)別結(jié)果。
1.4.1 圖增強(qiáng)
1.4.2 標(biāo)簽預(yù)測(cè)
2 實(shí)驗(yàn)
2.1 數(shù)據(jù)集及評(píng)價(jià)指標(biāo)
2.1.1 數(shù)據(jù)集
2.1.2 評(píng)價(jià)指標(biāo)
2.2 實(shí)驗(yàn)設(shè)置
2.3 實(shí)驗(yàn)結(jié)果
實(shí)驗(yàn)部分首先考慮到標(biāo)簽預(yù)測(cè)階段圖增強(qiáng)次數(shù)對(duì)模型性能影響較大,在兩個(gè)數(shù)據(jù)集上分別進(jìn)行實(shí)驗(yàn)研究,并選取最佳的增強(qiáng)次數(shù)取值;其次,進(jìn)行了消融實(shí)驗(yàn);最后,對(duì)本文模型和基線模型在識(shí)別準(zhǔn)確率和模型復(fù)雜度兩個(gè)方面進(jìn)行對(duì)比研究。
2.3.1 圖增強(qiáng)次數(shù)對(duì)模型性能的影響
為了研究在標(biāo)簽預(yù)測(cè)階段圖增強(qiáng)次數(shù)S對(duì)模型性能的影響,在兩個(gè)數(shù)據(jù)集上分別比較了在不同增強(qiáng)次數(shù)下所獲得的識(shí)別準(zhǔn)確率情況,結(jié)果如圖3所示。
實(shí)驗(yàn)結(jié)果表明:a)進(jìn)行圖增強(qiáng)處理后(即Sgt;0時(shí))的模型性能均優(yōu)于未進(jìn)行圖增強(qiáng)處理(即S=0時(shí))的模型性能,這表明引入圖增強(qiáng)機(jī)制能有效提高模型的識(shí)別準(zhǔn)確率,可能的原因是圖增強(qiáng)的引入可在一定程度上解決GCN在訓(xùn)練過(guò)程中出現(xiàn)的過(guò)擬合問(wèn)題;b)在IEMOCAP和RAVDESS數(shù)據(jù)集上,不同圖增強(qiáng)次數(shù)條件下,WA的變化幅度分別不超過(guò)1.10%和2.30%,UA的變化幅度不超過(guò)1.10%和2.80%,這表明圖增強(qiáng)次數(shù)對(duì)模型性能影響不大,為了兼顧模型復(fù)雜度可以選擇較低的值;c)考慮到WA和UA均越大越好,因此,針對(duì)IEMOCAP數(shù)據(jù)集,設(shè)定S=5,針對(duì)RAVDESS數(shù)據(jù)集,設(shè)定S=4,可使得模型在兩個(gè)數(shù)據(jù)集上分別達(dá)到最優(yōu)的結(jié)果,并且此后的實(shí)驗(yàn)均在該取值下進(jìn)行。
2.3.2 消融實(shí)驗(yàn)
2.3.3 模型性能對(duì)比
為了驗(yàn)證GO-GAGCN相較于基線模型在識(shí)別準(zhǔn)確率和模型復(fù)雜度方面的優(yōu)勢(shì),本實(shí)驗(yàn)在兩個(gè)數(shù)據(jù)集上對(duì)模型的性能進(jìn)行比較。
1)識(shí)別準(zhǔn)確率的對(duì)比
實(shí)驗(yàn)涉及以下三類(lèi)基線模型:a)單模態(tài)模型,包括了單語(yǔ)音模態(tài)模型或單文本模態(tài)模型;b)多模態(tài)模型,選擇近幾年內(nèi)最先進(jìn)(state-of-the-art,SOTA)的語(yǔ)音、文本多模態(tài)模型;c)基于圖神經(jīng)網(wǎng)絡(luò)的模型,為了分析圖增強(qiáng)GAGCN的有效性,在標(biāo)簽預(yù)測(cè)階段使用另外兩種圖神經(jīng)網(wǎng)絡(luò)進(jìn)行對(duì)比:(a)使用傳統(tǒng)的GCN[16],即GO-GCN;(b)使用SelfSAGCN[24],即GO-SelfSAGCN,該模型利用單位聚合從標(biāo)記的節(jié)點(diǎn)中提取語(yǔ)義特征,并使用類(lèi)中心相似度從不同方面獲得節(jié)點(diǎn)特征,從而提升GCN模型識(shí)別性能。在兩個(gè)數(shù)據(jù)集上的對(duì)比結(jié)果分別如表3和4所示。
首先,針對(duì)單模態(tài)模型,由表3、4均可看出,本文所提出的多模態(tài)融合模型能實(shí)現(xiàn)比單模態(tài)模型更佳的識(shí)別準(zhǔn)確率。以IEMOCAP數(shù)據(jù)集為例,GO-GAGCN模型相比于文獻(xiàn)[30]所提出的單聲學(xué)模態(tài)的SER模型在WA和UA上分別提高了11.05%和11.13%,比文獻(xiàn)[31]所提出的單文本模態(tài)的SER模型在WA和UA上分別提高了8.45%和11.63%,以上均表明在SER任務(wù)中充分利用多模態(tài)信息間的共性和互補(bǔ)性是有必要的。
其次,針對(duì)多模態(tài)模型,本文模型的性能最佳。其中,在IEMOCAP數(shù)據(jù)集上,本文的GO-GAGCN模型相比于FG-CME基線模型,WA和UA分別提高了2.32%和2.97%,在RAVDESS數(shù)據(jù)集上,WA和UA分別提高了2.34%和2.80%。這可能的原因是GCN的引入可充分利用節(jié)點(diǎn)特征間的拓?fù)浣Y(jié)構(gòu)特性對(duì)多模態(tài)特征進(jìn)行優(yōu)化,相比于FG-CME模型中僅對(duì)多模態(tài)特征進(jìn)行線性變換,能更加充分地利用到模態(tài)間的共性和互補(bǔ)性,以提升模型性能。
最后,針對(duì)其他圖神經(jīng)網(wǎng)絡(luò)模型,本文的模型性能同樣最優(yōu)。與GO-GCN相比,在IEMOCAP數(shù)據(jù)集上,WA和UA分別提高了1.09%和1.11%;在RAVDESS數(shù)據(jù)集上,WA和UA分別提高了1.94%和1.16%。與GO-SelfSAGCN相比,在IEMOCAP數(shù)據(jù)集上,WA和UA分別提高了0.87%和0.39%;在RAVDESS數(shù)據(jù)集上,WA和UA上分別提高了0.77%和0.71%。這可能的原因在于本文所提出的GO-GAGCN模型在標(biāo)簽預(yù)測(cè)階段引入圖增強(qiáng)機(jī)制可以更好地改善多層圖神經(jīng)網(wǎng)絡(luò)可能帶來(lái)的過(guò)平滑問(wèn)題,以此提升模型性能。
2)模型復(fù)雜度的對(duì)比
為了驗(yàn)證本文模型在模型復(fù)雜度方面的優(yōu)勢(shì),在IEMOCAP和RAVDESS數(shù)據(jù)集上分別比較三種GCN模型與FG-CME基線模型在模型參數(shù)量和時(shí)間復(fù)雜度上的情況,如圖4和5所示。
實(shí)驗(yàn)結(jié)果表明:首先,三種GCN模型的模型參數(shù)量和時(shí)間復(fù)雜度在兩個(gè)數(shù)據(jù)集上均低于FG-CME模型,充分說(shuō)明圖神經(jīng)網(wǎng)絡(luò)的引入確實(shí)能夠降低模型的復(fù)雜度,有利于實(shí)現(xiàn)實(shí)時(shí)性;其次,在兩個(gè)數(shù)據(jù)集上本文提出的GO-GAGCN模型,雖然相比于GO-GCN和GO-SelfSAGCN在模型參數(shù)量和時(shí)間復(fù)雜度上有所增加,但仍然低于FG-CME模型。主要的原因是GO-GAGCN在標(biāo)簽預(yù)測(cè)階段引入了圖增強(qiáng)機(jī)制,需要對(duì)增強(qiáng)的圖數(shù)據(jù)參數(shù)進(jìn)行訓(xùn)練,這勢(shì)必會(huì)增大模型參數(shù)量以及時(shí)間復(fù)雜度。總體而言,本文所提出的GO-GAGCN模型仍然具有較大的優(yōu)勢(shì)。
3 結(jié)束語(yǔ)
本文提出了一種基于圖神經(jīng)網(wǎng)絡(luò)多特征融合的語(yǔ)音情感識(shí)別模型GO-GAGCN。引入圖神經(jīng)網(wǎng)絡(luò)一方面進(jìn)一步優(yōu)化文本特征,再利用優(yōu)化后的文本特征重構(gòu)聲學(xué)特征的鄰接矩陣,使得聲學(xué)特征中包含文本信息;另一方面,利用圖增強(qiáng)神經(jīng)網(wǎng)絡(luò)對(duì)節(jié)點(diǎn)特征進(jìn)行全局優(yōu)化,以提升情感識(shí)別準(zhǔn)確率,同時(shí)也利用圖神經(jīng)網(wǎng)絡(luò)降低模型復(fù)雜度。在開(kāi)放數(shù)據(jù)集IEMOCAP和RAVDESS上驗(yàn)證了該算法的有效性。此外,還通過(guò)消融實(shí)驗(yàn)驗(yàn)證了本文模型各個(gè)組成部分對(duì)模型性能提升的積極作用。
在未來(lái)的研究中,首先,考慮在多模態(tài)特征融合時(shí)引入注意力機(jī)制以便充分利用多模態(tài)特征之間的互補(bǔ)性;其次,通過(guò)研究基于圖分類(lèi)的模型來(lái)改善基于節(jié)點(diǎn)分類(lèi)的模型在增加數(shù)據(jù)時(shí)需要重新訓(xùn)練的問(wèn)題;最后,對(duì)本文的模型進(jìn)行硬件化實(shí)現(xiàn),并將其植入嵌入式或可穿戴設(shè)備,以適應(yīng)不同的應(yīng)用場(chǎng)景。
參考文獻(xiàn):
[1]羅德虎,冉啟武,楊超,等.語(yǔ)音情感識(shí)別研究綜述[J].計(jì)算機(jī)工程與應(yīng)用,2022,58(21):40-52.(Luo Dehu, Ran Qiwu, Yang Chao, et al. Review on speech emotion recognition research[J].Computer Engineering and Applications,2022,58(21):40-52.)
[2]Hansen L, Zhang Yanping, Wolf D, et al. A generalizable speech emotion recognition model reveals depression and remission[J].Acta Psychiatrica Scandinavica,2022,145(2):186-199.
[3]Tanko D, Dogan S, Demir F B, et al. Shoelace pattern-based speech emotion recognition of the lecturers in distance education:ShoePat23[J].Applied Acoustics,2022,190:108637.
[4]Hsieh Y H, Chen S C. A decision support system for service recovery in affective computing: an experimental investigation[J].Knowledge and Information Systems,2020,62:2225-2256.
[5]Tan Liang, Yu Keping, Lin Long, et al. Speech emotion recognition enhanced traffic efficiency solution for autonomous vehicles in a 5G-enabled space-air-ground integrated intelligent transportation system[J].IEEE Trans on Intelligent Transportation Systems,2021,23(3):2830-2842.
[6]Daniel N, Kjell E, Kornel L. Emotion recognition in spontaneous speech using GMMs[C]//Proc of the 9th International Conference on Spoken Language Processing.2006:809-812.
[7]Nwe T L, Foo S W, De Silva L C. Speech emotion recognition using hidden Markov models[J].Speech Communication,2003,41(4):603-623.
[8]Schuller B, Reiter S, Muller R, et al. Speaker independent speech emotion recognition by ensemble classification[C]//Proc of IEEE International Conference on Multimedia and Expo.Piscataway,NJ:IEEE Press,2005:864-867.
[9]Mao Shuiyang, Ching P C, Lee T. Deep learning of segment-level feature representation with multiple instance learning for utterance-level speech emotion recognition[C]//Proc of the 20th Annual Conference of International Speech Communication Association.2019:1686-1690.
[10]Issa D, Demirci M F, Yazici A. Speech emotion recognition with deep convolutional neural networks[J].Biomedical Signal Proces-sing and Control,2020,59:101894.
[11]Zeng Yuni, Mao Hua, Peng Dezhong, et al. Spectrogram based multi-task audio classification[J].Multimedia Tools and Applications,2019,78(3):3705-3722.
[12]Feng Han, Ueno S, Kawahara T. End-to-end speech emotion recognition combined with acoustic-to-word ASR model[C]//Proc of the 21st Annual Conference of the International Speech Communication Association.2020:501-505.
[13]Sarma M, Ghahremani P, Povey D, et al. Emotion identification from raw speech signals using DNNs[C]//Proc of the 19th Annual Confe-rence of the International Speech Communication Association.2018:3097-3101.
[14]Satt A, Rozenberg S, Hoory R. Efficient emotion recognition from speech using deep learning on spectrograms[C]//Proc of the 18th Annual Conference of the International Speech Communication Associa-tion.2017:1089-1093.
[15]Krishna D N, Patil A. Multimodal emotion recognition using cross-modal attention and 1D convolutional neural networks[C]//Proc of the 21st Annual Conference of the International Speech Communication Association.2020:4243-4247.
[16]Welling M, Kipf T N. Semi-supervised classification with graph con-volutional networks[C]//Proc of the 5th International Conference on Learning Representations.2017.
[17]Shirian A, Guha T. Compact graph architecture for speech emotion recognition[C]//Proc of IEEE International Conference on Acoustics,Speech and Signal Processing.Piscataway,NJ:IEEE Press,2021:6284-6288.
[18]Liu Jiawang, Wang Haoxiang. Graph isomorphism network for speech emotion recognition[C]//Proc of the 22nd Annual Conference of the International Speech Communication Association.2021:3405-3409.
[19]Yoon S, Byun S, Jung K. Multimodal speech emotion recognition using audio and text[C]//Proc of IEEE Spoken Language Technology Workshop.Piscataway,NJ:IEEE Press,2018:112-118.
[20]Li Hang, Ding Wenbiao, Wu Zhongqin, et al. Learning fine-grained cross modality excitement for speech emotion recognition[C]//Proc of the 22nd Annual Conference of the International Speech Communication Association.2021:3375-3379.
[21]Xu Haiyang, Zhang Hui, Han Kun, et al. Learning alignment for multimodal emotion recognition from speech[C]//Proc of the 20th Annual Conference of International Speech Communication Association.2019:3569-3573.
[22]Liu Pengfei, Li Kun, Meng Helen. Group gated fusion on attention-based bidirectional alignment for multimodal emotion recognition[C]//Proc of the 21st Annual Conference of the International Speech Communication Association.2020:379-383.
[23]Feng Wenzheng, Zhang Jie, Dong Yuxiao, et al. Graph random neural networks for semi-supervised learning on graphs[J].Advances in Neural Information Processing Systems,2020,33:22092-22103.
[24]Yang Xu,Deng Cheng,Dang Zhiyuan,et al. SelfSAGCN: self-supervised semantic alignment for graph convolution network[C]//Proc of IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway,NJ:IEEE Press,2021:16775-16784.
[25]Busso C,Bulut M,Lee C C,et al.IEMOCAP: interactive emotional dyadic motion capture database[J].Language Resources and Evaluation,2008,42(4):335-359.
[26]Livingstone S R,Russo F A. The Ryerson audio-visual database of emotional speech and song (RAVDESS):a dynamic,multimodal set of facial and vocal expressions in North American English[J].PLoS One,2018,13(5):e0196391.
[27]Pennington J,Socher R,Manning C D. GloVe: global vectors for word representation[C]//Proc of Conference on Empirical Methods in Natural Language Processing.2014:1532-1543.
[28]Mirsamadi S,Barsoum E,Zhang Cha. Automatic speech emotion re-cognition using recurrent neural networks with local attention [C]//Proc of IEEE International Conference on Acoustics,Speech and Signal Processing.Piscataway,NJ:IEEE Press,2017:2227-2231.
[29]Ma Xi,Wu Zhiyong,Jia Jia,et al. Speech emotion recognition with emotion-pair based framework considering emotion distribution information in dimensional emotion space[C]//Proc of the 18th Annual Conference of the International Speech Communication Association.2017:1238-1242.
[30]Liu Jiawang,Wang Haoxiang,Sun Mingze,et al. Graph based emotion recognition with attention pooling for variable-length utterances[J].Neurocomputing,2022,496:46-55.
[31]Makiuchi M R,Uto K,Shinoda K. Multimodal emotion recognition with high-level speech and text features[C]//Proc of IEEE Automatic Speech Recognition and Understanding Workshop.Piscataway,NJ:IEEE Press,2021:350-357.