General Steganalysis Method of Compressed Speech Under Different Standards

2021-12-11 13:29:16PengLiuSongbinLiQiandongYanJingangWangandChengZhang

Computers Materials&Continua 2021年8期

Peng Liu,Songbin Li,＊,Qiandong Yan,Jingang Wang and Cheng Zhang

1Institute of Acoustics,Chinese Academy of Sciences,Beijing,100190,China

2The University of Melbourne,Melbourne,VIC3010,Australia

Abstract:Analysis-by-synthesis linear predictive coding(AbS-LPC)is widely used in a variety of low-bit-rate speech codecs.Most of the current steganalysis methods for AbS-LPC low-bit-rate compressed speech steganography are specifically designed for a specific coding standard or category of steganography methods,and thus lack generalization capability.In this paper,a general steganalysis method for detecting steganographies in low-bit-rate compressed speech under different standards is proposed.First,the code-element matrices corresponding to different coding standards are concatenated to obtain a synthetic code-element matrix, which will be mapped into an intermediate feature representation by utilizing the pre-trained dictionaries.Then,bidirectional long short-term memory is employed to capture long-term contextual correlations.Finally, a code-element affinity attention mechanism is used to capture the global inter-frame context,and a full connection structure is used to generate the prediction result.Experimental results show that the proposed method is effective and better than the comparison methods for detecting steganographies in cross-standard low-bit-rate compressed speech.

Keywords: Cross-standard; compressed speech; steganalysis; attention

1 Introduction

Data hiding is a technique of embedding secrets into digital media imperceptibly, and different types of media data are considered for steganography, including image [1,2], text [3,4],and video [5,6].In recent years, with the continuous growth of network bandwidth and the enhancement of network convergence, network streaming media services for communication have undergone unprecedented development.Since Voice over Internet Protocol (VoIP) technology [7,8]has been widely used for real-time communication, it has become an excellent carrier for transmitting secret information over the Internet.VoIP steganography is a means of imperceptibly embedding secret information into VoIP-based cover speech.There are many VoIP speech codecs,including G.711, G.723.1, G.726, G.728, G.729, internet Low Bitrate Codec (iLBC), and the Adaptive Multi-Rate (AMR) codec.Most of them, including G.723.1, G.729, AMR, and iLBC, are low-bit-rate speech codecs that use analysis-by-synthesis linear predictive coding (AbS-LPC) [9].At present, most methods of speech steganography utilize AbS-LPC low-bit-rate speech codecs to embed secret information for covert communication.Therefore, it is essential to develop a powerful steganalysis method to analyze low-bit-rate speech streams.

Information-hiding methods based on low-bit-rate speech streams can be divided into three categories according to the embedding position:The first category uses a pitch synthesis filter for information hiding [10-16], the second uses a LPC synthesis filter to hide information [17-22],and the third embeds information by directly modifying the value of some code elements in the compressed speech stream [23-30].

Figure 1:Difference between different levels of general steganalysis methods:(a) Non-general steganalysis method; (b) C1- and (c) C2-level general steganalysis methods

The existing steganalysis methods for AbS-LPC low-bit-rate compressed speech steganography are specifically designed for a specific coding standard or category of steganography methods.Thus, they lack generalization capacity.When general steganalysis is required, it is complex and time-consuming to enumerate all the steganalysis methods that correspond to the steganographic methods, which makes it difficult to meet the requirements of practical applications.In this paper,the generality of steganalysis algorithms is divided into two levels:one is general for different steganography algorithms under the same compression standard, and the other is general for steganography algorithms under different standards.For interpreting the idea of the proposed method, the generality of the first one is referred to as C1and that of the second one as C2.The general steganalysis algorithm of the C1level can effectively detect different information-hiding algorithms (e.g., quantization index modulation [31]) under the same standard, such as G.729.The general steganalysis method of the C2level can detect different information-hiding algorithms under an arbitrary standard.For example, to achieve general steganalysis of different coding standards, if non-general steganography detection methods are used, it is necessary to jointly use multiple steganalysis methods for different coding standards and different steganography methods,as shown in Fig.1a.As demonstrated in Fig.1b, different methods must be combined under different coding standards when using the steganalysis methods of the C1level.As shown in Fig.1c, only one detection method of the C2level is needed.Obviously, the ideal steganalysis method is to achieve C2-level generality, which is also the research focus of this paper.

Since speech signals are encoded by different encoding standards, the number of code elements (CEs) and their connotations are quite different.Therefore, it is unrealistic to perform C2-level general steganalysis directly based on the original compressed speech stream.In this paper, the compressed speech stream of different coding standards is first converted into an intermediate feature representation.Then, a classification network based on a CE affinity attention mechanism is built to accomplish steganalysis.

2 Proposed Method

The architecture of the proposed steganalysis method is illustrated in Fig.2.It can be divided into two parts:Intermediate feature representation and the steganalysis network.The intermediate feature representation is mainly used to convert compressed speech data under different coding standards into a general intermediate feature representation, and the steganalysis network performs steganalysis based on the intermediate feature.The details are described below.

Figure 2:Architecture of the proposed method.It consists of two parts:Intermediate feature representation and steganalysis network.The code elements of a speech are first converted to an intermediate feature representation.Then, a steganalysis network based on a code-element affinity attention module is employed to detect whether the speech contains hidden information

2.1 Intermediate Feature Representation

Assuming that one must detectmtypes of coding standards at the same time, the CE matrixXicorresponding to theith coding standard can be expressed as

whereNiis the number of CEs in a frame corresponding to theith coding standard, andis the value of theNith CE in frameT.To detect different coding standards at the same time,the CE matrices corresponding tomcoding standards are concatenated to obtain a synthetic CE matrixX:

whereis the value of theNmth CE in frameTcorresponding to themth coding standard.

To convert the values of CEs into a form that is easy to use by the neural network, one-hot coding is utilized to map each CE into a feature vector.For a CE that occupiesnbits, its coded value range is 0 ?2n?1.In one-hot encoding, a vector with a length of 2nis used to represent this CE.If the coded value of this CE isu, the one-hot representation can be denoted as

where

After one-hot coding, a group of independent CE one-hot representations are obtained, which are then aggregated according to the order of the original CEs to form a long feature vector,called a multi-hot vector.This process is called multi-hot coding.Taking the example in which there areMcode elements in a frame, the length of the corresponding multi-hot vector can be calculated as

wheredidenotes the number of bits that theith CE occupies.

However, some CEs may occupy too many bits, which will greatly increase the amount of calculation of the model.When the one-hot coding operation is conducted on these code elements,the length of the one-hot vector will become very large.This explosive dimension increase is more than can be afforded, so dimensionality reduction is needed.Therefore, a frequency count method is employed on these CEs.Experiments prove that this is a simple but very effective coding method.Specifically, for each CE that occupies more than 8 bits, the occurrence frequency of its every coded value is counted.Then, the coded values are arranged in order of frequency, and the first 255 values selected.These values are encoded as 0-254.The other coded values are encoded as 255.In this way, the coded value of all CEs can be mapped into 0-255.

A sparse representationRcan be obtained by applying multi-hot encoding.However, the sparse representation will bring an additional computational cost to the model, which is unfavorable for the real-time requirements of steganalysis.Inspired by natural language processing tasks, an embedding method for each CE is introduced.First, dictionaries are built for each CE to convert the multi-hot vectors to more compact representations into the intermediate feature.The parameters are randomly initialized in the dictionaries with the normal distribution.It is hoped that such an embedding representation can be obtained that has a strong robustness to the different embedding rates.Then, a large dataset consisting of different stego data and cover data is built to pre-train the dictionaries.At the pre-training stage, two-layer bidirectional long short-term memory (Bi-LSTM) [32] and a full connection layer are used, followed by a sigmoid activation function.Dictionaries and the training network are trained together.In addition, the dictionaries are fixed once the training is done.Before utilizing the steganalysis network to classify the input sample, the matrixRwill be converted into the embedding matrixEbased on the trained dictionaries.

2.2 Steganalysis Network

Since the front and back frames of a speech sample can influence each other, a two-layer Bi-LSTM is first employed to capture long-term contextual correlations ofE, and a better representation of the frame vector is generated.However, Bi-LSTM can only capture longrange dependencies, which lack local CE information.Inferring inter-frame context information from local CE information can simultaneously capture both intra- and inter-frame relationships,which is very important for low-bit-rate compressed speech steganalysis tasks.Global context information is useful for extracting a wide range of inter-frame dependencies and providing a comprehensive understanding of the entire input speech sequence, while local CE information plays a key role in understanding the secret information embedded in different CE positions.Based on this theory, the CE affinity attention module is proposed, which adaptively infers the global context information between frames under the guidance of the codeword affinity representation.

The architecture of the CE affinity attention module is illustrated in Fig.2.It consists of two branches:the first branch is used to calculate the local affinity attention vector, and the second deals with the feature representationyat a single scale.Moreover, the second branch determines the amount of information contained in the local affinity vectors.Both branches will be described in detail below.

In this paper, the output features calculated by Bi-LSTM are defined asO, whereTindicates the number of frames of the input data, andSthe feature dimension.In the first branch, the featuresOare first calculated by a global average pooling operation to obtain the global information representationg(O), which can express the global inter-frame information.The process can be defined as

whereoi,jdenotes the feature value at thejth position of theith frame.Then, a frame-wise multiplication between global informationg(Oi)and input featuresOis employed to obtain a new global-guided feature representation, which can be calculated by

whereaj∈Aindicates the affinity factor.

To endow the features used for classification with both long-range dependencies and global inter-frame context, the features output from Bi-LSTM and the codeword affinity module are integrated to form a more powerful feature representation.Then, the features are inputted into a classification that consists of two layers of full connection and a sigmoid activation function.A prediction probability valuep, which determines whether the hidden message exists in an input speech sequence, is then obtained:

3 Equations and Mathematical Expressions

Seven thousand speech segments were collected from the Internet, including samples from seven human voice categories, to form the speech database.Each category contains 1,000 speech segments.The seven categories are Chinese man, Chinese woman, English man, English woman,French, German, and Japanese.Each human voice category contains samples from more than five individuals.The duration of each speech segment is 10 s, and each segment is formatted as a mono PCM file with an 8,000-Hz sampling rate and 16-bit quantization.The speech segments in each category are divided into a training dataset and a testing dataset at a 4:1 ratio.The training dataset is used to conduct parameters adjustment of the model, and the test dataset is used to evaluate the model performance.The G.723.1 (6.3 kbit/s) and G.729 codecs are used to evaluate the performance of the proposed method.

Both the training and testing stages were executed on a GeForce GTX 2080 graphical processing unit with 11 Gb of graphics memory.PyTorch was used to help implement the model and algorithm.In addition, in the process of training the neural network, Adam was used as the optimizer with a learning rate of 1×10?4, and the cross-entropy chosen as the loss function.The maximal training epoch was 200, and the batch size in the training process was 16.

As mentioned above, three main categories of steganography methods exist for AbS-LPC low-bit-rate compressed speech.To comprehensively test the performance of the proposed model,a representative method [15,17,24] was chosen for each steganography category.For simplicity, the chosen methods are denoted “ACL” [15], “CNV” [17], and “HYF” [24].It should be noted that the ACL and HYF methods are designed for the G.723.1 standard, and the CNV method was used for steganography under the G.729 standard; All three methods were used for steganography under the G.723.1 standard.

To the best of our knowledge, no general method has been designed for the detection of steganographies in cross-standard AbS-LPC low-bit-rate compressed speech.The MFCC-based steganalysis method [33] can, in theory, detect any type of steganography based on the decoded audio/speech data.In this sense, this method is believed to be general as well.Besides, Hu et al.[34] proposed a SFFN-based general steganalysis method for specialized coding standards.In the present paper, these methods are used as comparison algorithms with which to evaluate the proposed method.

The embedding rate is defined as the ratio of the number of embedded bits to the total embedding capacity.Experiments on the three steganography methods for the G.723.1 standard were conducted under five different embedding rates (20%-100%).The experimental results are shown in Tab.1.For ACL, the detection accuracy of the MFCC method is only 51.58% when the embedding rate is 20%, slightly better than a random guess.As a comparison, the detection accuracy of the proposed method is 98.96%, far exceeding that of the MFCC method.However,the detection accuracy of SFFN achieves 99.54%, 0.58% higher than the proposed method.When the embedding rate is 40% or above, both SFFN and the proposed method have a detection accuracy of 100%.For HYF and CNV, when the embedding rate is 20%, the detection accuracies of the proposed method are 35.73% and 37.26% higher, respectively, than that of MFCC.By contrast, the detection accuracies of SFFN are 8.48% and 12% higher than that of MFCC,respectively.When the embedding rate is 80% or above, SFFN can achieve detection accuracies greater than 95%, while the proposed method can achieve the same accuracy when the embedding rate is only 20%.

Table 1:Detection accuracies of 10 s of speech with different embedding rates for G.723.1 standard.Results in bold are for the proposed method

Since the ACL and HYF methods are designed for the G.723.1 standard, the CNV method is used for steganography under the G.729 standard.Experiments on the CNV method were conducted under five different embedding rates (20%-100%).The experimental results are shown in Tab.2, from which it can be seen that the proposed method performs better than MFCC and SFFN at all embedding rates.When the embedding rate is 20%, the detection accuracies of the proposed method are 32.73% higher than that of MFCC and 6.74% higher than that of SFFN.When the embedding rate is 80% or above, SFFN can achieve detection accuracies greater than 99%, while the proposed method can achieve the same accuracy when the embedding rate is only 40%.

Table 2:Detection accuracies of 10 s of speech with different embedding rates for G.729 standard.Results in bold are for the proposed method

In summary, the proposed method achieves the best results at all embedding rates under the G.723.1 and G.729 standards, except for a 20% embedding rate and ACL steganography under the G.723.1 standard, which is 0.58% lower than that of SFFN.The experimental results indicate that the proposed steganalysis method can be effective for detecting steganographies in cross-standard low-bit-rate compressed speech.

4 Conclusions

In this paper, a common method for detecting steganographies in cross-standard low-bit-rate compressed speech based on intermediate feature representation is proposed.To detect multiple coding standards at the same time, the code element (CE) matrices corresponding tomcoding standards are first concatenated to obtain a synthetic CE matrix.Then, one-hot coding is utilized to convert this matrix into a form that is easy to use by a neural network.Inspired by the ideas in natural language processing, dictionaries are built for each CE by transforming them into intermediate features to achieve more compact representations.These features are inputted into the resulting steganalysis network to obtain the final classification result.Experimental results indicate the superiority in accuracy and performance of the proposed method.

Funding Statement:This work is supported partly by Hainan Provincial Natural Science Foundation of China under Grant No.618QN309, partly by the Important Science & Technology Project of Hainan Province under Grant Nos.ZDKJ201807 and ZDKJ2020010, partly by the Scientific Research Foundation Project of Haikou Laboratory, Institute of Acoustics, Chinese Academy of Sciences, and partly by the IACAS Young Elite Researcher Project (QNYC201829 and QNYC201747).

Conflicts of Interest:The authors declare that they have no conflicts of interest to report regarding the present study.

Computers Materials&Continua2021年8期

Computers Materials&Continua的其它文章: Deep-Learning-Empowered 3D Reconstruction for Dehazed Images in IoT-Enhanced Smart Cities; VGG-CovidNet:Bi-Branched Dilated Convolutional Neural Network for Chest X-Ray-Based COVID-19 Predictions; A Technical Framework for Selection of Autonomous UAV Navigation Technologies and Sensors; Decision Making in Internet of Vehicles Using Pervasive Trusted Computing Scheme; A New Segmentation Framework for Arabic Handwritten Text Using Machine Learning Techniques; Using Semantic Web Technologies to Improve the Extract Transform Load Model