999精品在线视频,手机成人午夜在线视频,久久不卡国产精品无码,中日无码在线观看,成人av手机在线观看,日韩精品亚洲一区中文字幕,亚洲av无码人妻,四虎国产在线观看 ?

General Steganalysis Method of Compressed Speech Under Different Standards

2021-12-11 13:29:16PengLiuSongbinLiQiandongYanJingangWangandChengZhang
Computers Materials&Continua 2021年8期

Peng Liu,Songbin Li,*,Qiandong Yan,Jingang Wang and Cheng Zhang

1Institute of Acoustics,Chinese Academy of Sciences,Beijing,100190,China

2The University of Melbourne,Melbourne,VIC3010,Australia

Abstract:Analysis-by-synthesis linear predictive coding(AbS-LPC)is widely used in a variety of low-bit-rate speech codecs.Most of the current steganalysis methods for AbS-LPC low-bit-rate compressed speech steganography are specifically designed for a specific coding standard or category of steganography methods,and thus lack generalization capability.In this paper,a general steganalysis method for detecting steganographies in low-bit-rate compressed speech under different standards is proposed.First,the code-element matrices corresponding to different coding standards are concatenated to obtain a synthetic code-element matrix, which will be mapped into an intermediate feature representation by utilizing the pre-trained dictionaries.Then,bidirectional long short-term memory is employed to capture long-term contextual correlations.Finally, a code-element affinity attention mechanism is used to capture the global inter-frame context,and a full connection structure is used to generate the prediction result.Experimental results show that the proposed method is effective and better than the comparison methods for detecting steganographies in cross-standard low-bit-rate compressed speech.

Keywords: Cross-standard; compressed speech; steganalysis; attention

1 Introduction

Data hiding is a technique of embedding secrets into digital media imperceptibly, and different types of media data are considered for steganography, including image [1,2], text [3,4],and video [5,6].In recent years, with the continuous growth of network bandwidth and the enhancement of network convergence, network streaming media services for communication have undergone unprecedented development.Since Voice over Internet Protocol (VoIP) technology [7,8]has been widely used for real-time communication, it has become an excellent carrier for transmitting secret information over the Internet.VoIP steganography is a means of imperceptibly embedding secret information into VoIP-based cover speech.There are many VoIP speech codecs,including G.711, G.723.1, G.726, G.728, G.729, internet Low Bitrate Codec (iLBC), and the Adaptive Multi-Rate (AMR) codec.Most of them, including G.723.1, G.729, AMR, and iLBC, are low-bit-rate speech codecs that use analysis-by-synthesis linear predictive coding (AbS-LPC) [9].At present, most methods of speech steganography utilize AbS-LPC low-bit-rate speech codecs to embed secret information for covert communication.Therefore, it is essential to develop a powerful steganalysis method to analyze low-bit-rate speech streams.

Information-hiding methods based on low-bit-rate speech streams can be divided into three categories according to the embedding position:The first category uses a pitch synthesis filter for information hiding [10-16], the second uses a LPC synthesis filter to hide information [17-22],and the third embeds information by directly modifying the value of some code elements in the compressed speech stream [23-30].

Figure 1:Difference between different levels of general steganalysis methods:(a) Non-general steganalysis method; (b) C1- and (c) C2-level general steganalysis methods

The existing steganalysis methods for AbS-LPC low-bit-rate compressed speech steganography are specifically designed for a specific coding standard or category of steganography methods.Thus, they lack generalization capacity.When general steganalysis is required, it is complex and time-consuming to enumerate all the steganalysis methods that correspond to the steganographic methods, which makes it difficult to meet the requirements of practical applications.In this paper,the generality of steganalysis algorithms is divided into two levels:one is general for different steganography algorithms under the same compression standard, and the other is general for steganography algorithms under different standards.For interpreting the idea of the proposed method, the generality of the first one is referred to as C1and that of the second one as C2.The general steganalysis algorithm of the C1level can effectively detect different information-hiding algorithms (e.g., quantization index modulation [31]) under the same standard, such as G.729.The general steganalysis method of the C2level can detect different information-hiding algorithms under an arbitrary standard.For example, to achieve general steganalysis of different coding standards, if non-general steganography detection methods are used, it is necessary to jointly use multiple steganalysis methods for different coding standards and different steganography methods,as shown in Fig.1a.As demonstrated in Fig.1b, different methods must be combined under different coding standards when using the steganalysis methods of the C1level.As shown in Fig.1c, only one detection method of the C2level is needed.Obviously, the ideal steganalysis method is to achieve C2-level generality, which is also the research focus of this paper.

Since speech signals are encoded by different encoding standards, the number of code elements (CEs) and their connotations are quite different.Therefore, it is unrealistic to perform C2-level general steganalysis directly based on the original compressed speech stream.In this paper, the compressed speech stream of different coding standards is first converted into an intermediate feature representation.Then, a classification network based on a CE affinity attention mechanism is built to accomplish steganalysis.

2 Proposed Method

The architecture of the proposed steganalysis method is illustrated in Fig.2.It can be divided into two parts:Intermediate feature representation and the steganalysis network.The intermediate feature representation is mainly used to convert compressed speech data under different coding standards into a general intermediate feature representation, and the steganalysis network performs steganalysis based on the intermediate feature.The details are described below.

Figure 2:Architecture of the proposed method.It consists of two parts:Intermediate feature representation and steganalysis network.The code elements of a speech are first converted to an intermediate feature representation.Then, a steganalysis network based on a code-element affinity attention module is employed to detect whether the speech contains hidden information

2.1 Intermediate Feature Representation

Assuming that one must detectmtypes of coding standards at the same time, the CE matrixXicorresponding to theith coding standard can be expressed as

whereNiis the number of CEs in a frame corresponding to theith coding standard, andis the value of theNith CE in frameT.To detect different coding standards at the same time,the CE matrices corresponding tomcoding standards are concatenated to obtain a synthetic CE matrixX:

whereis the value of theNmth CE in frameTcorresponding to themth coding standard.

To convert the values of CEs into a form that is easy to use by the neural network, one-hot coding is utilized to map each CE into a feature vector.For a CE that occupiesnbits, its coded value range is 0 ?2n?1.In one-hot encoding, a vector with a length of 2nis used to represent this CE.If the coded value of this CE isu, the one-hot representation can be denoted as

where

After one-hot coding, a group of independent CE one-hot representations are obtained, which are then aggregated according to the order of the original CEs to form a long feature vector,called a multi-hot vector.This process is called multi-hot coding.Taking the example in which there areMcode elements in a frame, the length of the corresponding multi-hot vector can be calculated as

wheredidenotes the number of bits that theith CE occupies.

However, some CEs may occupy too many bits, which will greatly increase the amount of calculation of the model.When the one-hot coding operation is conducted on these code elements,the length of the one-hot vector will become very large.This explosive dimension increase is more than can be afforded, so dimensionality reduction is needed.Therefore, a frequency count method is employed on these CEs.Experiments prove that this is a simple but very effective coding method.Specifically, for each CE that occupies more than 8 bits, the occurrence frequency of its every coded value is counted.Then, the coded values are arranged in order of frequency, and the first 255 values selected.These values are encoded as 0-254.The other coded values are encoded as 255.In this way, the coded value of all CEs can be mapped into 0-255.

A sparse representationRcan be obtained by applying multi-hot encoding.However, the sparse representation will bring an additional computational cost to the model, which is unfavorable for the real-time requirements of steganalysis.Inspired by natural language processing tasks, an embedding method for each CE is introduced.First, dictionaries are built for each CE to convert the multi-hot vectors to more compact representations into the intermediate feature.The parameters are randomly initialized in the dictionaries with the normal distribution.It is hoped that such an embedding representation can be obtained that has a strong robustness to the different embedding rates.Then, a large dataset consisting of different stego data and cover data is built to pre-train the dictionaries.At the pre-training stage, two-layer bidirectional long short-term memory (Bi-LSTM) [32] and a full connection layer are used, followed by a sigmoid activation function.Dictionaries and the training network are trained together.In addition, the dictionaries are fixed once the training is done.Before utilizing the steganalysis network to classify the input sample, the matrixRwill be converted into the embedding matrixEbased on the trained dictionaries.

2.2 Steganalysis Network

Since the front and back frames of a speech sample can influence each other, a two-layer Bi-LSTM is first employed to capture long-term contextual correlations ofE, and a better representation of the frame vector is generated.However, Bi-LSTM can only capture longrange dependencies, which lack local CE information.Inferring inter-frame context information from local CE information can simultaneously capture both intra- and inter-frame relationships,which is very important for low-bit-rate compressed speech steganalysis tasks.Global context information is useful for extracting a wide range of inter-frame dependencies and providing a comprehensive understanding of the entire input speech sequence, while local CE information plays a key role in understanding the secret information embedded in different CE positions.Based on this theory, the CE affinity attention module is proposed, which adaptively infers the global context information between frames under the guidance of the codeword affinity representation.

The architecture of the CE affinity attention module is illustrated in Fig.2.It consists of two branches:the first branch is used to calculate the local affinity attention vector, and the second deals with the feature representationyat a single scale.Moreover, the second branch determines the amount of information contained in the local affinity vectors.Both branches will be described in detail below.

In this paper, the output features calculated by Bi-LSTM are defined asO, whereTindicates the number of frames of the input data, andSthe feature dimension.In the first branch, the featuresOare first calculated by a global average pooling operation to obtain the global information representationg(O), which can express the global inter-frame information.The process can be defined as

whereoi,jdenotes the feature value at thejth position of theith frame.Then, a frame-wise multiplication between global informationg(Oi)and input featuresOis employed to obtain a new global-guided feature representation, which can be calculated by

whereaj∈Aindicates the affinity factor.

To endow the features used for classification with both long-range dependencies and global inter-frame context, the features output from Bi-LSTM and the codeword affinity module are integrated to form a more powerful feature representation.Then, the features are inputted into a classification that consists of two layers of full connection and a sigmoid activation function.A prediction probability valuep, which determines whether the hidden message exists in an input speech sequence, is then obtained:

3 Equations and Mathematical Expressions

Seven thousand speech segments were collected from the Internet, including samples from seven human voice categories, to form the speech database.Each category contains 1,000 speech segments.The seven categories are Chinese man, Chinese woman, English man, English woman,French, German, and Japanese.Each human voice category contains samples from more than five individuals.The duration of each speech segment is 10 s, and each segment is formatted as a mono PCM file with an 8,000-Hz sampling rate and 16-bit quantization.The speech segments in each category are divided into a training dataset and a testing dataset at a 4:1 ratio.The training dataset is used to conduct parameters adjustment of the model, and the test dataset is used to evaluate the model performance.The G.723.1 (6.3 kbit/s) and G.729 codecs are used to evaluate the performance of the proposed method.

Both the training and testing stages were executed on a GeForce GTX 2080 graphical processing unit with 11 Gb of graphics memory.PyTorch was used to help implement the model and algorithm.In addition, in the process of training the neural network, Adam was used as the optimizer with a learning rate of 1×10?4, and the cross-entropy chosen as the loss function.The maximal training epoch was 200, and the batch size in the training process was 16.

As mentioned above, three main categories of steganography methods exist for AbS-LPC low-bit-rate compressed speech.To comprehensively test the performance of the proposed model,a representative method [15,17,24] was chosen for each steganography category.For simplicity, the chosen methods are denoted “ACL” [15], “CNV” [17], and “HYF” [24].It should be noted that the ACL and HYF methods are designed for the G.723.1 standard, and the CNV method was used for steganography under the G.729 standard; All three methods were used for steganography under the G.723.1 standard.

To the best of our knowledge, no general method has been designed for the detection of steganographies in cross-standard AbS-LPC low-bit-rate compressed speech.The MFCC-based steganalysis method [33] can, in theory, detect any type of steganography based on the decoded audio/speech data.In this sense, this method is believed to be general as well.Besides, Hu et al.[34] proposed a SFFN-based general steganalysis method for specialized coding standards.In the present paper, these methods are used as comparison algorithms with which to evaluate the proposed method.

The embedding rate is defined as the ratio of the number of embedded bits to the total embedding capacity.Experiments on the three steganography methods for the G.723.1 standard were conducted under five different embedding rates (20%-100%).The experimental results are shown in Tab.1.For ACL, the detection accuracy of the MFCC method is only 51.58% when the embedding rate is 20%, slightly better than a random guess.As a comparison, the detection accuracy of the proposed method is 98.96%, far exceeding that of the MFCC method.However,the detection accuracy of SFFN achieves 99.54%, 0.58% higher than the proposed method.When the embedding rate is 40% or above, both SFFN and the proposed method have a detection accuracy of 100%.For HYF and CNV, when the embedding rate is 20%, the detection accuracies of the proposed method are 35.73% and 37.26% higher, respectively, than that of MFCC.By contrast, the detection accuracies of SFFN are 8.48% and 12% higher than that of MFCC,respectively.When the embedding rate is 80% or above, SFFN can achieve detection accuracies greater than 95%, while the proposed method can achieve the same accuracy when the embedding rate is only 20%.

Table 1:Detection accuracies of 10 s of speech with different embedding rates for G.723.1 standard.Results in bold are for the proposed method

Since the ACL and HYF methods are designed for the G.723.1 standard, the CNV method is used for steganography under the G.729 standard.Experiments on the CNV method were conducted under five different embedding rates (20%-100%).The experimental results are shown in Tab.2, from which it can be seen that the proposed method performs better than MFCC and SFFN at all embedding rates.When the embedding rate is 20%, the detection accuracies of the proposed method are 32.73% higher than that of MFCC and 6.74% higher than that of SFFN.When the embedding rate is 80% or above, SFFN can achieve detection accuracies greater than 99%, while the proposed method can achieve the same accuracy when the embedding rate is only 40%.

Table 2:Detection accuracies of 10 s of speech with different embedding rates for G.729 standard.Results in bold are for the proposed method

In summary, the proposed method achieves the best results at all embedding rates under the G.723.1 and G.729 standards, except for a 20% embedding rate and ACL steganography under the G.723.1 standard, which is 0.58% lower than that of SFFN.The experimental results indicate that the proposed steganalysis method can be effective for detecting steganographies in cross-standard low-bit-rate compressed speech.

4 Conclusions

In this paper, a common method for detecting steganographies in cross-standard low-bit-rate compressed speech based on intermediate feature representation is proposed.To detect multiple coding standards at the same time, the code element (CE) matrices corresponding tomcoding standards are first concatenated to obtain a synthetic CE matrix.Then, one-hot coding is utilized to convert this matrix into a form that is easy to use by a neural network.Inspired by the ideas in natural language processing, dictionaries are built for each CE by transforming them into intermediate features to achieve more compact representations.These features are inputted into the resulting steganalysis network to obtain the final classification result.Experimental results indicate the superiority in accuracy and performance of the proposed method.

Funding Statement:This work is supported partly by Hainan Provincial Natural Science Foundation of China under Grant No.618QN309, partly by the Important Science & Technology Project of Hainan Province under Grant Nos.ZDKJ201807 and ZDKJ2020010, partly by the Scientific Research Foundation Project of Haikou Laboratory, Institute of Acoustics, Chinese Academy of Sciences, and partly by the IACAS Young Elite Researcher Project (QNYC201829 and QNYC201747).

Conflicts of Interest:The authors declare that they have no conflicts of interest to report regarding the present study.

主站蜘蛛池模板: 国产精品永久免费嫩草研究院| 国产99视频精品免费观看9e| 国产欧美精品午夜在线播放| 亚洲欧美日韩色图| 性欧美久久| 中文字幕亚洲另类天堂| 欧美在线国产| v天堂中文在线| 国产 在线视频无码| 国产本道久久一区二区三区| 久久综合成人| 国产精品自在在线午夜| www亚洲天堂| 女人av社区男人的天堂| 中文国产成人精品久久一| 色妞永久免费视频| 亚洲熟女中文字幕男人总站| 亚洲动漫h| 欧美一级高清片久久99| 97在线视频免费观看| 免费人成在线观看成人片| vvvv98国产成人综合青青| 成年人午夜免费视频| 午夜少妇精品视频小电影| 国产精品自在拍首页视频8| www.91在线播放| 欧美日韩中文国产| 四虎永久免费地址| 中文成人在线| 亚洲Av激情网五月天| 天天做天天爱夜夜爽毛片毛片| 欧美日韩国产在线观看一区二区三区| 国产丝袜第一页| 91久久偷偷做嫩草影院电| 国产尤物视频网址导航| 国产va在线| 免费看a级毛片| 亚洲第一成人在线| 99热这里只有成人精品国产| 午夜福利免费视频| 久久精品一卡日本电影| 国产精品视频999| 大陆国产精品视频| 国产chinese男男gay视频网| 中国国产A一级毛片| 日本黄网在线观看| 天堂亚洲网| 国产SUV精品一区二区6| 国产精品视频第一专区| 国产亚洲高清在线精品99| 国产精品久久久久鬼色| 国产91成人| 欧美日韩一区二区在线免费观看| 亚洲国产欧美目韩成人综合| 人妻精品久久久无码区色视| 美女视频黄频a免费高清不卡| 亚洲看片网| 亚洲视频在线网| 精品国产亚洲人成在线| 国产一级精品毛片基地| 国产乱子精品一区二区在线观看| 久久亚洲综合伊人| 亚洲欧洲日韩综合| 精品成人免费自拍视频| 国产在线专区| 18禁不卡免费网站| 国产人成在线视频| 呦女精品网站| 丰满人妻一区二区三区视频| 成年女人a毛片免费视频| 国产精品短篇二区| 亚洲婷婷在线视频| 成人精品亚洲| 亚洲国产中文综合专区在| 国产精品无码AV中文| 欧美另类精品一区二区三区| 精品91视频| 97狠狠操| 亚洲国产高清精品线久久| 国国产a国产片免费麻豆| 91探花国产综合在线精品| 精品超清无码视频在线观看|