基于連續語音識別技術的豬連續咳嗽聲識別

2019-05-11 06:13:44劉望宏雷明剛譚鶴群

農業工程學報 2019年6期

關鍵詞：模型

黎煊，趙建，高云，劉望宏，雷明剛，譚鶴群

基于連續語音識別技術的豬連續咳嗽聲識別

黎煊1,2，趙建1,2，高云1,2，劉望宏2,3，雷明剛2,3，譚鶴群1,2

（1. 華中農業大學工學院，武漢 430070；2. 生豬健康養殖協同創新中心，武漢 430070； 3. 華中農業大學動物科技學院動物醫學院，武漢 430070）

針對現有基于孤立詞識別技術的豬咳嗽聲識別存在識別聲音種類有限，無法反映實際患病豬連續咳嗽的問題，該文提出了基于雙向長短時記憶網絡-連接時序分類模型（birectional long short-term memory-connectionist temporal classification， BLSTM-CTC）構建豬聲音聲學模型，進行豬場環境豬連續咳嗽聲識別的方法，以此進行豬早期呼吸道疾病的預警和判斷。研究了體質量為75 kg左右長白豬單個咳嗽聲樣本的持續時間長度和能量大小的時域特征，構建了聲音樣本持續時間在0.24～0.74 s和能量大于40.15V2?s的閾值范圍。在此閾值范圍內，利用單參數雙門限端點檢測算法對基于多窗譜的心理聲學語音增強算法處理后的30 h豬場聲音進行檢測，得到222段試驗語料。將豬場環境下的聲音分為豬咳嗽聲和非豬咳嗽聲，并以此作為聲學模型建模單元，進行語料的標注。提取26維梅爾頻率倒譜系數（Mel frequency cepstral coefficients，MFCC）作為試驗語段特征參數。通過BLSTM網絡學習豬連續聲音的變化規律，并利用CTC實現了端到端的豬連續聲音識別系統。5折交叉驗證試驗平均豬咳嗽聲識別率達到92.40%，誤識別率為3.55%，總識別率達到93.77%。同時，以數據集外1 h語料進行了算法應用測試，得到豬咳嗽聲識別率為94.23%，誤識別率為9.09%，總識別率為93.24%。表明基于連續語音識別技術的BLSTM-CTC豬咳嗽聲識別模型是穩定可靠的。該研究可為生豬健康養殖過程中豬連續咳嗽聲的識別和疾病判斷提參考。

信號處理；聲音信號；識別；生豬產業；連續咳嗽聲；雙向長短時記憶網絡-連接時序分類模型；聲學模型

0 引言

目前，市場對豬肉的需求量在所有動物肉類中比重最大[1]。然而，隨著生豬產業規模化的發展，豬呼吸道疾病嚴重威脅了豬肉的質量和產量，通過豬咳嗽聲的監測可以及時發現豬呼吸道疾病[2-4]。目前豬場監測豬咳嗽聲的方法是人為蹲點監測，不僅人力成本高，而且無法保證較理想的識別率。本文基于語音識別技術開展豬場環境下豬咳嗽聲的自動識別研究，以促進生豬健康養殖的發展[5]。

在豬咳嗽聲時域特征的研究過程中，Mitchell等[3]研究了比利時長白和杜洛克雜交豬的咳嗽聲，發現病豬、健康豬咳嗽聲持續時間分別為0.3和0.21 s；Sara等[4]通過對長白和大白雜交豬咳嗽聲的研究，發現病豬、健康豬咳嗽聲持續時間分別為0.67和0.43 s。由此可見豬的咳嗽聲持續時間長度與豬的健康狀況以及品種都有關系。另外，Cordeiro等[1]采用不同的冷熱環境對豬進行刺激，發現處于緊張狀態下的豬所發聲音持續時間長于1.02 s，并以此閾值作為決策樹算法（decision tree algorithm）的判斷標準，對豬是否處于緊張狀態進行判斷。

在豬咳嗽聲識別的研究過程中，Exadaktylos等[6]采用模糊C均值聚類算法識別豬咳嗽，總識別率達到85%。同樣基于模糊C均值聚類算法進行豬咳嗽聲識別的工作有Hirtum等[7]，識別率達到92%，錯誤率達到21%；徐亞妮等[8]識別率達到83.4%；Guarino等[9]則采用動態時間規整（dynamic time warping，DTW）算法識別豬咳嗽，識別率達到85.5%；劉振宇等[10]采用隱馬爾科夫模型（hidden markov model，HMM）對豬咳嗽聲進行識別，識別率達到80.0%；黎煊等[11]基于深度信念網絡（deep belief nets，DBN）實現了豬咳嗽聲識別，豬咳嗽聲識別率達到95.80%，誤識別率為6.83%，總識別率達到94.29%。

前人的工作均是基于孤立詞的豬咳嗽聲識別和研究，所考慮的非豬咳嗽聲種類有限，故所得模型對于沒有學習的豬場其他聲音樣本無法做出識別判斷，模型實用性受到限制；另外，患病豬每次咳嗽時，會進行多次連續性的咳嗽[12-13]，故通過豬的連續咳嗽聲識別更能反映豬的患病情況。

目前，國內外關于豬連續聲音識別的研究工作未曾報道，但是越來越多的學者已經通過構建聲學模型，將連續語音識別技術運用于其他動物的聲音識別研究上。聲學模型是連續語音識別系統的重要組成部分，通過選擇合適的聲學建模單元可以很方便地描述語音信號的物理變換規律。Milone等[14]構建了牛吃食聲的聲學模型，實現了牛連續吃食聲的識別。類似的研究工作還有Reby等[15]實現了鹿連續聲音的識別，Milone等[16]實現了羊連續吃食聲的識別，Trifa等[17]實現了蟻鳥連續聲音的識別。

為此，本文開展了豬連續咳嗽聲識別的研究。通過雙向長短時記憶（birectional long short-term memory，BLSTM）網絡[18-20]對豬連續聲音進行特征學習，進一步借助連接時序分類（connectionist temporal classification，CTC）[21]直接對輸入豬連續聲音序列和其標注的對齊分布進行建模，實現端到端[22-23]的豬連續咳嗽聲識別系統，以期為生豬健康養殖過程中豬連續咳嗽聲的識別和疾病的判斷提供方法參考。

1 豬聲音采集與特征參數提取

1.1 豬聲音采集

豬聲音采集地點為華中農業大學校屬精品豬場。用美博M66錄音筆（采樣頻率為48 kHz）進行采集。采集時間為2016年3?4月氣溫變換明顯的豬病多發期進行。聲音采集對象為10頭體質量75 kg左右的長白豬，各5頭分開飼養于相鄰兩欄。經獸醫診斷10頭豬中5頭感染呼吸道疾病，咳嗽明顯。將錄音筆固定于兩欄中間靠近豬舍墻壁上離地1.5 m處，進行每天24 h連續豬場環境聲音的采集。對錄音筆采集的聲音進行選取，保留豬咳嗽頻繁時間段的語音信號共30 h進行試驗。

1.2 豬聲音去噪

豬場環境噪聲復雜，過多的噪聲對后續端點檢測和豬聲音的識別都有不利的影響。本文選擇基于多窗譜的心理聲學語音增強算法[11]實現豬連續聲音的去噪。圖1所示為語音增強算法處理前后時長為8.50 s豬連續咳嗽聲波形對比圖，由圖1b可知豬連續聲音信號噪聲得到明顯削減，并且通過人耳試聽感知，發現豬聲音樣本幾乎沒有失真。

1.3 豬咳嗽聲時域特征研究

豬場采集的連續聲音中聲音種類繁雜，豬聲音主要包括咳嗽、打噴嚏、吃食、尖叫、哼哼、甩耳朵等，環境噪聲主要包括狗叫聲、金屬碰撞聲、抽風機噪聲等其他聲音，這些聲音與豬咳嗽聲在持續時間和能量大小等時域特征上存在明顯的差異。本文從前人[3-4]通過對豬聲音持續時間、能量大小等特征的研究工作中得到啟示，研究了本試驗中單個豬咳嗽聲樣本的持續時間長度和能量大小。

圖1 語音增強前后豬連續咳嗽聲波形圖

經過分幀處理后的豬咳嗽聲樣本()的持續時間長度dur計算公式為

式中是經過分幀后豬咳嗽樣本總幀數，是幀長，根據聲音信號的短時平穩特性取為25 ms，inc是幀移，取為幀長的40%，F是采樣頻率，Hz。

令豬咳嗽聲樣本()經過分幀后第幀表示為y()，則豬咳嗽聲樣本()的能量計算公式為

式中是采樣點序號。

利用Direct Splitter語音信號處理軟件從錄音筆采集聲音中隨機截取了597個豬咳嗽聲樣本，按照公式（1）和（2）分別計算每個樣本的持續時間長度和能量大小，進一步得到最大最小值，結果如表1所示。

表1 單個豬咳嗽聲樣本時域特征分析結果

由表1分析結果可知，本試驗對象長白豬咳嗽聲持續時間從0.24～0.74 s不等，研究結果與前人的研究結果[3-4]類似。由于豬咳嗽的強度和豬距離錄音筆的距離都會對咳嗽聲樣本的能量造成影響。相對而言，能量越高的樣本表示豬咳嗽越劇烈且豬距離錄音筆越近，故能量閾值只考慮其下限值。

豬聲音信號端點檢測是指從包含豬聲音的連續信號中找出所有聲音樣本的起止點，把起止點之間的信號定義為有效信號。利用文獻[11]中基于短時能量的單參數雙門限端點檢測算法檢測錄音筆采集的30 h連續豬場聲音的有效信號，并對檢測出的每一個聲音樣本按照表1中持續時長上下限和能量下限值設定的閾值范圍進行判斷，剔除不在此范圍內的聲音樣本。最終得到222段試驗語料，其中最長9.14 s，最短3.91 s。所有222段語料共包含聲音樣本1 145個，其中豬咳嗽樣本一共751個，非豬咳嗽樣本一共394個。

在獸醫幫助下，采用人工標記法對222段語料進行標注得到對應的序列標記，將聲學建模單元中豬咳嗽聲和非豬咳嗽聲分別用符號‘k’和‘n’表示。

1.4 特征參數提取

梅爾頻率倒譜系數（Mel frequency cepstral coefficients，MFCC）[24-25]的分析是基于人耳的聽覺機理進行的。將線性頻譜映射到非線性的Mel頻譜中，依據人的聽覺試驗結果來分析聲音的頻譜特性。將豬連續聲音語段經過分幀加窗后，采用快速傅里葉變換計算其頻譜能量，然后將其通過梅爾濾波器組，對濾波器輸出取對數得到梅爾濾波能量，再計算其離散余弦變換得到可以反映豬聲音靜態特性的13維梅爾頻率倒譜系數，最后加入反映豬聲音動態特性的一階差分系數，得到26維梅爾頻率倒譜系數。梅爾頻率倒譜系數特征參數提取過程具體步驟如圖2所示。

圖2 MFCC特征參數提取步驟

2 豬連續咳嗽聲識別

2.1 BLSTM網絡模型

相對于前饋神經網絡[26-27]隱層神經元之間無連接的特點，RNN（recurrent neural network）是一種允許隱層神經元存在自反饋通路的神經網絡結構。RNN隱層輸入不僅包括輸入層輸入的豬聲音特征，也包括上一時刻隱層神經元的輸出，這種網絡結構有利于模型對前面的信息進行記憶，并應用于處理當前輸出的計算中。雖然RNN理論上很適合處理類似語音序列的建模問題，但是隨著語音序列長度的增加存在著梯度爆炸和消失的問題[20,28]。LSTM是一種特殊的RNN，其通過引入記憶單元和門限機制可以學習歷史信息，并控制信息的累積速度，在一定程度上緩解了存在于RNN模型中的問題，LSTM模塊單元如圖3所示。

由圖3可知LSTM單元主要由4個部分組成：記憶單元（memory cell）、輸入門（input gate）、輸出門（output gate）和遺忘門（forget gate）。在LSTM網絡中記憶單元彼此互相連接，3個非線性門控單元可以調節輸入和輸出記憶單元的信息（如圖3中虛線連接所示）。其中輸入門控制哪些信息會被輸入到記憶單元，通過讀取上一時刻記憶單元輸出h-1和此時刻輸入x，輸出一個在0和1之間的數值，i表示要輸入信息的百分比，0表示全部舍棄，1表示完全輸入。i計算公式為

i=(W[h-1, x]+ b) （3）

式中是sigmoid函數，W是輸入門權值，b是輸入門閾值。

注：it表示輸入信息的百分比，ft表示遺忘信息的百分比，ot輸出門狀態值大小，黑色實心圓表示進行乘積運算。Note: it is the percentage of the input information of input gate; ft is the percentage of the forgotten information of forget gate; ot is the state value of output gate, and the black circle indicates the multiplication operation.

類似的，遺忘門控制需要忘記上一時刻記憶單元狀態c-1的哪些信息，f表示要遺忘信息的百分比，計算公式為

f=(W[h-1, x]+ b) （4）

式中W是遺忘門權值，b是遺忘門閾值。

于是可得到此時刻記憶單元的狀態c計算公式如下

c= f c-1+ itanh(W[h-1, x]+ b) （5）

式中tanh是雙曲正切函數，W是記憶單元權值，b是記憶單元閾值。

輸出門值o控制記憶單元此時刻輸出了多少信息，于是有如下計算公式

o=(W[h-1, x]+ b) （6）

h=otanh c（7）

式中W是輸出門權值，b是輸出門閾值，h是此時刻記憶單元輸出。

傳統LSTM是單向展開的，只能利用歷史信息，而豬連續咳嗽聲識別是對整個語音序列的識別。當前幀的特征不僅與前面各幀有聯系，也與后面各幀有關聯。因此通過2個獨立的LSTM來分別處理前向和后向[29-30]豬連續聲音序列（圖4），然后將輸出組合進入網絡下一層進行處理，充分挖掘上下文時序信息進行豬連續聲音的聲學建模。

2.2 連接時序分類CTC

在連續語音識別系統中，CTC（connectionist temporal classification）層利用BLSTM學習序列信號的強大能力直接對輸入語音特征和輸出標簽進行建模[21,31]，而不必依賴語音特征序列與序列標記之間的對齊，從而實現了端到端的聲學模型訓練。

注：xt-1、xt和xt+1分別表示t-1、t和t+1時刻輸入層輸入， ct-1、ct和ct+1分別表示隱層記憶單元t-1、t和t+1時刻的狀態值，ht-1、ht和ht+1分別表示t-1、t和t+1時刻記憶單元輸出，上標→、←分別表示前向傳播和后向傳播。

BLSTM模型輸出作為CTC層輸入，輸出神經元個數即所有可能的標簽個數，即聲學建模單元個數，額外加入一個空白標簽用于估計輸出的靜音，在本系統中標簽個數為3，即‘k’、‘n’和‘_’，其中‘_’為空白標簽，表示靜音模型。于是BLSTM模型的輸出可以描述輸入連續語音對應的標簽概率分布。給定長度為的連續輸入語料，在時刻BLSTM模型輸出標簽索引(∈{1,2,3})的概率表示為

式中y是BLSTM網絡時刻輸出標簽的值，即輸出層神經元的輸出值，l是BLSTM模型時刻輸出的標簽索引，為標簽個數。

令CTC輸出序列為π，則π是由個標簽組成的長度為的序列，將個時刻的概率值相乘即得到π的概率為

實際上，每個真實序列標記有多個CTC輸出序列π與之對應，定義從π到的映射=(π)，通過將可能序列中的重復標簽和空白標簽去掉[23]就可以將π轉化為。例如，對于一個為8的豬連續聲音信號，若其真實序列標記為（n, k, n, k），相應的CTC輸出序列可以為（n, _, k, k, k, n, k, k）或（n, n, _, k, n, k, _, _）等。于是可得到

上式可利用前向后向算法[18]通過動態規劃的思想計算并求導。若*為對應連續輸入語料的序列標記，CTC訓練目的就是讓BLSTM網絡輸出*的概率最大化，也即概率的負導數最小化，設定損失函數為

2.3 豬連續咳嗽聲識別系統

圖5所示為基于BLSTM-CTC聲學模型的豬連續咳嗽聲識別系統。首先將豬連續聲音特征參數作為BLSTM輸入，利用BLSTM的強大聲學建模能力學習處理輸入語音的特征，接著網絡輸出豬連續聲音語料特征對應的標簽概率分布，以此概率分布作為CTC層輸入，同時借助原始語料序列標記計算模型損失，進一步實現整個聲學模型的訓練。

圖5 豬連續咳嗽聲識別系統框圖

訓練好的豬連續咳嗽聲識別系統可以應用于豬連續聲音語料的識別，測試過程會輸出一個行列的概率矩陣，表示在所有時刻輸入幀經過系統輸出后對應標簽的概率分布，通過集束收索算法[32]可解碼得到最大概率輸出序列，即為識別結果。

3 試驗與結果分析

3.1 試驗設計

對BLSTM-CTC豬連續咳嗽聲識別模型進行性能評估。試驗采用5折交叉驗證方法進行，將222段試驗數據集劃分為5個大小近似相等的互斥子集，然后每次用4個子集的并集作為訓練集，第5個子集作為測試集，這樣就得到5組訓練、測試集，從而可以進行5次訓練和測試。

3.2 評價指標

在以識別基元為聲學模型建模單元的連續語音識別系統中一般以詞錯誤率[33]（word error rate，WER）作為系統評價指標，將識別結果與測試語料的序列標記進行對比，計算替代誤差個數（substitution）、插入誤差個數（insertion）和刪除誤差個數（deletion）三者之和，再除以測試語料中總樣本個數，得到WER，即

關于3種誤差的解釋如下例：序列標記為(n, , k, k, n, n)，識別結果為（n,k,n, _），由識別結果與序列標記對比可知，識別結果中第一個豬咳嗽聲為插入誤差，第二個非豬咳嗽聲為替代誤差，序列標記中的最后一個非豬咳嗽聲沒有被識別出來，為刪除誤差。由于本文主要進行豬連續咳嗽聲的識別，所以在識別過程中，僅考慮非豬咳嗽聲的替代誤差，同時忽略了非豬咳嗽聲的插入和刪除誤差。為此，本文利用改建的WER來對豬連續咳嗽聲識別系統進行性能評估，評估指標豬咳嗽聲識別率R、誤識別率R和總識別率total計算公式分別如下所示。

式中S、I、D、N、S、N分別表示豬咳嗽聲識別為非豬咳嗽聲個數、插入豬咳嗽聲個數、刪除豬咳嗽聲個數、測試集中豬咳嗽聲個數、非豬咳嗽聲識別為豬咳嗽聲個數、測試集中非豬咳嗽聲個數。

3.3 試驗參數設置與結果分析

通過多次試驗對比，最終將BLSTM前向傳播過程和后向傳播過程隱層神經元、全連接層神經元個數均設置為300，學習率設置為0.001，訓練過程最大迭代次數為200。5折交叉驗證試驗結果如表2所示。

表2 豬連續咳嗽聲識別5折交叉驗證結果

通過表2的交叉驗證試驗對應的5組試驗識別結果可知，各組豬咳嗽聲識別率和總識別率均達到90.00%，誤識別率控制在8.00%以內。并且5折交叉驗證結果平均豬咳嗽聲識別率達到92.40%，誤識別率達到3.55%，總識別率達到93.77%，本文采用的基于BLSTM-CTC聲學模型的豬連續咳嗽聲識別系統是穩定有效的。

3.4 算法應用測試

為了對基于連續語音識別技術的豬連續咳嗽聲識別模型進行算法應用測試，另取一段長度為1 h豬場環境語料為試驗對象。先進行語音增強，然后利用基于閾值的端點檢測算法獲得測試數據集14段，其中最長8.51 s，最短3.56 s。此14段語料共包含聲音樣本74個，其中豬咳嗽樣本52個，非豬咳嗽樣本22個。接著對此14段測試語料進行人工句級標記，特征參數提取，最后利用表2第2組數據所得模型進行算法應用測試。測試結果豬咳嗽聲發生替代誤差1次、插入誤差1次、刪除誤差1次，非豬咳嗽聲發生替代誤差2次。分別計算得到豬咳嗽聲識別率為94.23%，誤識別率為9.09%，總識別率為93.24%。算法應用測試結果表明基于連續語音識別技術的豬連續咳嗽聲識別模型對于訓練測試數據集外的樣本同樣可得到較理想的識別效果，模型穩定可靠。

4 結論

本文提出了一種進行豬場環境豬連續咳嗽聲識別的方法。該方法相對孤立詞識別技術而言，可以識別更多種類的豬場環境聲音，更能反映豬的患病狀況，語料處理、特征提取、識別等過程比孤立詞識別技術更簡單。

1）提出了豬聲音的聲學模型，并且引入具有強大時序信號處理能力的雙向長短時記憶網絡結構和連接時序分類層來構建豬聲音聲學模型。以豬咳嗽聲與非豬咳嗽聲為聲學建模單元，對連續語料進行了標注，實現了端到端的豬連續咳嗽聲識別系統。

2）通過5折交叉驗證試驗，將BLSTM前向傳播過程和后向傳播過程隱層神經元、全連接層神經元個數均設置為300，學習率設置為0.001，5折交叉驗證試驗平均豬咳嗽聲識別率達到92.40%，誤識別率為3.55%，總識別率達到93.77%。同時，以數據集外1 h語料進行了算法應用測試。得到豬咳嗽聲識別率為94.23%，誤識別率為9.09%，總識別率為93.24%，表明基于連續語音識別技術的BLSTM-CTC豬連續咳嗽聲識別模型是穩定可靠的。

[1] Cordeiro A, N??s I, Leit?o F, et al. Use of vocalisation to identify sex, age, and distress in pig production[J]. Biosystems Engineering, 2018, 173：57－63.

[2] Silva M, Ferrari S, Costa A, et al. Cough localization for the detection of respiratory diseases in pig houses[J]. Computers and Electronics in Agriculture, 2008, 64(2): 286－292.

[3] Mitchell S, Vasileios E, Sara F, et al. The influence of respiratory disease on the energy envelope dynamics of pig cough sounds[J]. Computers and Electronics in Agriculture, 2009, 69(1): 80－85.

[4] Sara F, Mitchell S, Marcella G, et al. Cough sound analysis to identify respiratory infection in pigs[J]. Computers and Electronics in Agriculture, 2009, 64(2): 318－325.

[5] 何東健，劉冬，趙凱旋. 精準畜牧業中動物信息智能感知與行為檢測研究進展[J]. 農業機械學報，2016，47(5)：231－244. He Dongjian, Liu Dong, Zhao Kaixuan. Review of perceiving animal information and behavior in precision livestock farming[J]. Transactions of the Chinese Society for Agricultural Machinery, 2016, 47(5): 231－244. (in Chinese with English abstract)

[6] Exadaktylos V, Silva M, Aerts J M, et al. Real-time recognition of sick pig cough sounds[J]. Computers and Electronics in Agriculture, 2008, 63(2): 207－214.

[7] Hirtum A V, Berckmans D. Fuzzy approach for improved recognition of citric acid induced piglet coughing from continuous registration[J]. Journal of Sound and Vibration, 2003, 266(3): 677－686.

[8] 徐亞妮，沈明霞，閆麗，等. 待產梅山母豬咳嗽聲識別算法的研究[J]. 南京農業大學學報，2016，39(4)：681－687. Xu Yani, Shen Mingxia, Yan Li, et al. Research of predelivery meishan sow cough recognition algorithm[J]. Journal of Nanjing Agricultural University, 2016, 39(4): 681－687. (in Chinese with English abstract)

[9] Guarino M, Jans P, Costa A, et al. Field test of algorithm for automatic cough detection in pig house[J]. Computers and Electronics in Agriculture, 2008, 62(1): 22－28.

[10] 劉振宇，赫曉燕，桑靜，等. 基于隱馬爾可夫模型的豬咳嗽聲音識別的研究[C]//中國畜牧獸醫學會信息技術分會第十屆學術研討會論文集，2015：99－104.

[11] 黎煊，趙建，高云，等. 基于深度信念網絡的豬咳嗽聲識別[J]. 農業機械學報，2018，49(3)：179－186. Li Xuan, Zhao Jian, Gao Yun, et al. Recognitional of pig cough sound based on deep belief nets[J]. Transactions of the Chinese Society for Agricultural Machinery, 2018, 49(3): 179－186. (in Chinese with English abstract)

[12] 陳升科. 從中獸醫學角度分析豬咳嗽氣喘及治療方案[J]. 中國動物保健，2015，17(3)：22－23.

[13] 陳潤生. 豬咳嗽疾病的鑒別診斷[J]. 現代農業科技，2016(14)：269－270.

[14] Milone D H, Galli J R, Cangianoc C A, et al. Automatic recognition of ingestive sounds of cattle based on hidden markov models[J]. Computers and Electronics in Agriculture, 2012, 87(3): 51－55.

[15] Reby D, Andreobrecht R, Galinier A, et al. Cepstral coefficients and hidden markov models reveal idiosyncratic voice characteristics in red deer (cervus elaphus) stags[J]. Journal of the Acoustical Society of America, 2006, 120(6): 4080－4089.

[16] Milone D H, Rufiner H L, Galli J R, et al. Computational method for segmentation and classification of ingestive sounds in sheep[J]. Computers and Electronics in Agriculture, 2009, 65(2): 228－237.

[17] Trifa V M, Kirschel A N, Taylor C E, et al. Automated species recognition of antbirds in a mexican rainforest using hidden markov models[J]. Journal of the Acoustical Society of America, 2008, 123(4): 2424－2431.

[18] Sepp H, Jurgen S. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735－1780.

[19] 陳英義，程倩倩，方曉敏，等. 主成分分析和長短時記憶神經網絡預測水產養殖水體溶解氧[J]. 農業工程學報，2018，34(17)：183－191. Chen Yingyi, Cheng Qianqian, Fang Xiaomin, et al. Principal component analysis and long short-term memory neural network for predicting dissolved oxygen in water for aquaculture[J]. Transactions of the Chinese Society of Agricultural Engineering (Transactions of the CSAE),2018, 34(17): 183－191. (in Chinese with English abstract)

[20] Bengio Y, Frasconi P, Simard P. The problem of learning long-term dependencies in recurrent networks[C]// IEEE International Conference on Neural Networks. IEEE, 1993: 1183－1188.

[21] 王智超，張鵬遠，潘接林，等. 連接時序分類準則聲學建模方法優化[J]. 聲學學報，2018，43(6): 984－990.

Wang Zhichao, Zhang Pengyuan, Pan Jielin, et al. Optimization of acoustic modeling method with connectionist temporal classification criterion[J]. Acta Acustica, 2018,43(6): 984－990. (in Chinese with English abstract)

[22] Bahdanau D, Chorowski J, Serdyuk D, et al. End-to-end attention-based large vocabulary speech recognition[C]// IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2016: 4945－4949.

[23] Graves A, JaitlyA N. Towards end-to-end speech recognition with recurrent neural networks[C]// International Conference on Machine Learning, 2014: 1764－1772.

[24] Chia A O, Hariharan M, Yaacob S, et al. Classification of speech dysfluencies with mfcc and lpcc features[J]. Expert Systems with Applications, 2012, 39(2): 2157－2165.

[25] 李志忠，騰光輝. 基于改進MFCC的家禽發聲特征提取方法[J]. 農業工程學報，2008，24(11):202－205.

Li Zhizhong, Teng Guanghui. Feature extraction for poultry vocalization recognition based on improved MFCC[J]. Transactions of the Chinese Society of Agricultural Engineering (Transactions of the CSAE), 2008, 24(11): 202－205. (in Chinese with English abstract)

[26] Hinton G E. Learning multiple layers of representation[J]. Trends in Cognitive Sciences, 2007, 11(10): 428－434.

[27] Lecun Y, Bengio Y, Hinton G E. Deep learning[J]. Nature, 2015, 512: 436－444.

[28] 趙明，杜回芳，董翠翠，等. 基于word2vec和LSTM的飲食健康文本分類研究[J]. 農業機械學報，2017，48(10)：202－208. Zhao Ming, Du Huifang, Dong Cuicui, et al. Diet health text classification based on word2vec and LSTM[J]. Transactions of the Chinese Society for Agricultural Machinery, 2017, 48(10): 202－208. (in Chinese with English abstract)

[29] Schuster M, Paliwal K K. Bidirectional recurrent neural networks[J]. IEEE Transactions on Signal Processing, 2002, 45(11): 2673－2681.

[30] Chen K, Huo Q . Training deep bidirectional LSTM acoustic model for LVCSR by a Context-Sensitive-Chunk BPTT approach[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2016, 24(7): 1185－1193.

[31] Woellmer M , Eyben F , Schuller B , et al. Spoken term detection with connectionist temporal classification: A novel hybrid CTC-DBN decoder[C]//International Conference on Acoustics Speech and Signal Processing (ICASSP). IEEE, 2010: 5274－5277.

[32] Graves A, Gomez F. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks[C]// International Conference on Machine Learning. ACM, 2006: 369－376.

[33] Abu-Khzam F N, Fernau H, Langston M A, et al. A fixed-parameter algorithm for string-to-string correction[C]// Sixteenth Symposium on Computing: the Australasian Theory(CATS 2010). Australian Computer Society, 2010: 31－37.

Pig continuous cough sound recognition based on continuous speech recognition technology

Li Xuan1,2, Zhao Jian1,2, Gao Yun1,2, Liu Wanghong2,3, Lei Minggang2,3, Tan Hequn1,2

(1.,,430070,; 2.,430070,; 3.,,,430070,)

Cough is one of the most frequent symptoms in the early stage of pig respiratory diseases. So it is possible to monitor and diagnose the diseases of pigs by detecting their coughs. The existing methods for pig cough recognition are based on key word recognition technology, which cannot recognize the samples that have not been trained or learned by itself, another drawback is that the methods are for isolated coughs while the coughs of sick pigs are usually continuous. This paper intends to realize the recognition of pig continuous cough sound based on continuous speech recognition technology. Ten Landrace pigs, with a body weight of about 75 kg, were used as sound collection objects, and pig sounds were collected in pig farms during late winter and early spring when the respiratory diseases of pigs were prevalent. The sound collection devices were working continuously all day. By selecting the frequent coughing phases in the collected signal, a total of 30 h pig farm sound signals were obtained as the experimental corpus. Firstly, the sound signals were denoised by the speech enhancement algorithm based on a psychoacoustical model. Then the time-domain characteristics, including duration and energy of individual cough, were studied, and it was found that the duration of pig cough ranged from 0.24 to 0.74 s and the energy ranged from 40.15 to 822.87V2·s. So threshold of the sound samples was set with the duration and the lower energy value of individual coughs. Based on the threshold range, the speech endpoint detection algorithm based on short-time energy was used to detect the 30 h pig field sound signals which had been preprocessed by the speech enhancement algorithm, and 222 experimental sentences were obtained. The longest was 9.14 s and the shortest was 3.91 s. All 222 corpus contained a total of 1 145 sound samples, including 751 pig coughs and 394 non-pig coughs. Sounds in the pig farm environment, including cough, sneeze, eating, scream, hum, shaking ears sounds of pigs and sounds of dogs, metal clanging and some other background noise, were divided into pig cough and non-pig cough, which were chosen as the acoustic modeling units. The labels of the experimental sentences were obtained with the help of experts. Then the 13-dimensional Mel frequency cepstrum coefficients (MFCC) reflecting the static characteristics of pig sound were extracted, and the first-order differential coefficients reflecting the dynamic characteristics of pig sound were added to obtain the 26-dimensional MFCC, which were used as the characteristic parameter of the experimental sentence. Finally, the bidirectional Long Short-term Memory-Connectionist temporal classification(BLSTM-CTC) model was selected to recognize the pig continuous sounds, specifically, the BLSTM network had excellent feature learning ability of continuous pig sounds, and the CTC could directly model the alignment of the input continuous pig sound sequence and its labels. Through the 5-fold cross-validation experiment and analysis, the number of hidden layer neurons in the BLSTM forward propagation process, the backward propagation process, and the fully connected layer, were all set to 300, and the learning rate was set to 0.001. The average recognition rate, error recognition rate and total recognition rate of the results of 5 groups were 92.40%, 3.55% and 93.77%, respectively. Furthermore, the algorithm application test was carried out with another 1 h data, and the recognition rate reached to 94.23%, the error recognition rate was 9.09% with the total recognition rate of 93.24%. It is indicated that the pig cough sound recognition model based on continuous speech recognition technology is stable and reliable. This paper provides a reference for the recognition and disease judgment of pig continuous cough sound during the healthy breeding of pigs.

signal processing; acoustic signal; recognition; pig industry; continuous cough; birectional long short-term memory-connectionist temporal classification; acoustic model

2018-11-09

2019-01-13

國家重點研發計劃項目（2018YFD0500700）；華中農業大學自主科技創新基金；華中農業大學大北農青年學者提升專項項目（2017DBN005）；現代農業產業技術體系項目（CARS-36）；國家級大學生創新創業訓練計劃（201810504074）

黎煊，副教授，博士，主要從事生豬信息智能感知與行為識別研究。Email：lx@mail.hzau.edu.cn

10.11975/j.issn.1002-6819.2019.06.021

TN912.34

1002-6819(2019)-06-0174-07

黎煊，趙建，高云，劉望宏，雷明剛，譚鶴群. 基于連續語音識別技術的豬連續咳嗽聲識別[J]. 農業工程學報，2019，35(6)：174－180. doi：10.11975/j.issn.1002-6819.2019.06.021 http://www.tcsae.org

Li Xuan, Zhao Jian, Gao Yun, Liu Wanghong, Lei Minggang, Tan Hequn. Pig continuous cough sound recognition based on continuous speech recognition technology[J]. Transactions of the Chinese Society of Agricultural Engineering (Transactions of the CSAE), 2019, 35(6): 174－180. (in Chinese with English abstract) doi：10.11975/j.issn.1002-6819.2019.06.021 http://www.tcsae.org

基于連續語音識別技術的豬連續咳嗽聲識別

0 引 言

1 豬聲音采集與特征參數提取

1.1 豬聲音采集

1.2 豬聲音去噪

1.3 豬咳嗽聲時域特征研究

1.4 特征參數提取

2 豬連續咳嗽聲識別

2.1 BLSTM網絡模型

2.2 連接時序分類CTC

2.3 豬連續咳嗽聲識別系統

3 試驗與結果分析

3.1 試驗設計

3.2 評價指標

3.3 試驗參數設置與結果分析

3.4 算法應用測試

4 結 論

0 引言

4 結論