Limin Pan, Xiaonan Qin and Senlin Luo
(Information System and Security & Countermeasures Experimental Center, Beijing Institute of Technology, Beijing 100081, China)
Abstract: In order to implement the robust cluster analysis, solve the problem that the outliers in the data will have a serious disturbance to the probability density parameter estimation, and there?fore affect the accuracy of clustering, a robust cluster analysis method is proposed which is based on the diversity self?paced t?mixture model. This model firstly adopts the t?distribution as the sub?model which tail is easily controllable. On this basis, it utilizes the entropy penalty expectation con?ditional maximal algorithm as a pre?clustering step to estimate the initial parameters. After that,this model introduces l2,1?norm as a self?paced regularization term and developes a new ECM optim?ization algorithm, in order to select high confidence samples from each component in training. Fi?nally, experimental results on several real?world datasets in different noise environments show that the diversity self?paced t?mixture model outperforms the state?of?the?art clustering methods. It provides significant guidance for the construction of the robust mixture distribution model.
Key words: cluster analysis; Gaussian mixture model; t?distribution mixture model; self?paced learning;initialization
Clustering is a method of dividing data into clusters of similar components according to spe?cified similarity criteria in order to explore hid?den characteristics in a data set[1]. It has been widely used in pattern recognition, speech recog?nition, image analysis and in biomedical fields[2?4].From the statistical point of view, the clustering method can be divided into two types of ap?proaches: probabilistic model?based approaches and nonparametric approaches. The probabilistic model?based approaches are used to statistically model the target task by assuming that the data?set follows a certain set of probability distribu?tion. Such models can provide the probability that a sample belongs to each component and achieve relatively high scalability by changing the distribution or the number of clusters.
The representative mixture distribution mo?dels are considered a Gaussian mixture model(GMM)[5]. The GMM can smoothly approximate any continuous density function to an expected accuracy by taking sufficient linear combinations of Gaussian distributions[6]. As a flexible probab?ilistic model, GMM can provide the probability that a sample belongs to each class and performs better than other clustering algorithms can per?form on the data with overlapping clusters. The GMM is often used to describe the distribution shape of the cluster category by mean and vari?ance. Obtaining the parameters of GMM usually adopts the expectation maximum algorithm(EM)[7].
Although it is well known that GMM is used to describe statistical problems with good results,sometimes a more robust mixture model is neces?sary to describe some phenomena, namely the presence of atypical values or outliers that will affect the component mean and covariance matrices[8]. Compared with the GMM, t?distribu?tion mixture model (TMM)[9]has been changed in distribution to improve robustness. The t?dis?tribution tail is thicker which is a combination of Gaussian distribution and Gamma distribution.TMM can reduce the impact of outliers on the model. However, the TMM still cannot eliminate the influence of outliers on the model[10]. As the feature dimension increases, outliers become harder to find and eliminate.
In this paper, a robust cluster analysis meth?od is proposed based on diversity self?paced t?dis?tribution mixture model (DSP?TMM). To im?prove the robustness, the model combines TMM with self?paced learning and develops a new op?timization algorithm based on the ECM al?gorithm, which can pick out high?confidence samples and eliminate outliers. However, since the optimization algorithm starts from a small dataset, it will increase the sensitivity of the ini?tial parameters. To address this problem, DSP?TMM utilizes entropy penalty expectation condi?tional maximal algorithm (EPECM) to setup ini?tial parameters to make the model robust. In summary, the contribution of this paper is threefold.
① A two?phase self?paced robust cluster model which can eliminate the effects of outliers is proposed;
② To improve the stability of this frame?work, the EPECM algorithm is proposed to ad?aptively estimate parameters and component quantities in pre?clustering;
③ The performance of the DSP?TMM is evaluated experimentally and the results are compared with other advanced clustering al?gorithms.
There is a wide variety of methods focused on constructing a robust mixture distribution model. The main solution is to develop a thick?tail distribution to eliminate the effects of out?liers. Bhownmick et al. proposed the Laplace dis?tribution mixture model (LMM)[11]. The LMM is a linear combination of Laplacian distribution which is heavy tail distribution. D Peel et al.proposed the t?distribution mixture model[12].France et al. proposed the skew?t distribution mixture model[13]. Moreover, Lin et al. proposed robust mixtures of factor analysis models using the restricted multivariate skew?t distribution[14].Wei et al. combined the mixtures of generalized hyperbolic distributions and mixtures of skew?t distributions for robust clustering with incom?plete data[15]. The most commonly used robust model is TMM which aims to exploit a heavy?tailed feature of t?distribution for robustness.


However, the above methods still can not eliminate the influence of outliers. To further im?prove the robustness of clustering, Zhang et al.first introduced self?paced learning to TMM and proposed the SPTMM algorithm[10]. Self?paced learning (SPL) is a recent algorithm developed based on curriculum learning (CL)[16]which is in?spired by human learning principles. The basic principle of SPTMM is to start with easy samples and then gradually bring more complex ex?amples to the training. Therefore, SPTMM ef?fectively eliminates outliers and proves that the combination of SPL and TMM can improve the robustness of the model. However, SPTMM does not consider the differences in sample loss between clusters in the iterative training process.Moreover, because the introduction of SPL will intensify the sensitivity to initial value settings, a pre?clustering method which estimates initial parameters is needed before clustering.
Considering that the variations in the clus?tering results are often random and uncertain due to the uncertainty of the initial parameters,the pre?clustering method should be applied be?fore applying the clustering model using EM al?gorithm. The simplest random initialization schemes are random selection and emEM meth?ods. Random selection selects random feature vectors from the dataset as component means and uses mixing proportions equal to 1/G.However, this method can lead to cluster in?stability due to its randomness. The idea of emEM involves performing several short EM runs using different random starting points and a lax convergence criterion[17]. The most widely used method is through introducing other clustering algorithms to construct a robust mixture model,including K?Means[18], hierarchical clustering,density peak clustering[19], etc.
However, for the previously mentioned meth?ods, they are only suitable for well?separated clusters[20]. Some methods estimate the number of clusters while estimating the parameters. Such al?gorithms are mainly classified into the greedy method[21], local splitting method and entropy penalty EM algorithm[22]. The pre?clustering is completed by maximizing the target likelihood function to estimate the parameters of the num?ber of components in the model. The entropy penalty EM algorithm is to add a penalty term in the object function, discard the components that do not meet the requirements in the itera?tion, and the algorithm completes parameter es?timation and component estimation. Compared to the previous initialization algorithm, this al?gorithm has faster operation and better perform?ance, which is suitable for data with overlapping clusters.
Therefore, in order to make the DSP?TMM adaptively complete the estimation of the initial parameters and the number of components, the EPECM algorithm is constructed based on the ECM as pre?clustering. The EPECM can also make the clustering model robust to the initial parameters and improve the performance of self?paced learning. To further reduce the effect of the outliers and maintaining sample selection di?versity, our model introduces the l2,1?norm as a self?paced regularization term to make sure that a balanced number of high confidence samples are selected from each component. At the same time, an optimization algorithm is proposed to solve the corresponding optimization problem. To give a further explanation, we shall briefly intro?duce the details of DSP?TMM in the next section.
In this section, the implementation details of DSP?TMM will be described. The training pro?cess of the DSP?TMM algorithm is demonstra?ted in Fig. 1. The whole framework is divided in?to two phases. The first phase is pre?clustering in which using the EPECM algorithm based on t?distribution to obtain the specified number of components and initialization parameters. The number of components and parameters consti?tute the input to the second phase. After that, a new optimization algorithm is developed for DSP?TMM to eliminate outliers and obtain the identification of each sample and model paramet?ers finally.

Fig. 1 Implementation process of the DSP?TMM algorithm
To solve the problem that DSP?TMM ad?opts ECM optimization to increase the sensitiv?ity to the initial parameter, pre?clustering will be carried out in the first phase. The EPECM al?gorithm is used to adaptively fit the data to es?timate the parameters and determine the num?ber of components. The EPECM algorithm is used to add a penalty term to the mixing coeffi?cient in the object function E0, so that the com?petition between the components is encouraged in the iterative process. The components smaller than the set threshold are discarded in the itera?tion. The final parameters are obtained when the number of components is stable. The object func?tion of EPECM is



The details of the EPECM are summarized in Algorithm 1.

Algorithm 1: Entropy penalty expectation conditional maximal algorithm Input: Given training samples ,initial component , the threshold of mixing coefficient .πj =1/G0 μ0 j =X Σ0 j =Id?d β0 ={x1,x2,...,xn}G0 ε/n Initialization , , ,1, , (j=1,2,…, )1. while G is not stable do ν0 j =100 G0 2. Update to and to in E Step (Eqs.(5)(6));τ(k)ij τ(k+1)ij u(k)ij u(k+1)ij 3. Update to (Eq. (12));ω(k)j ω(k+1)j 4. If , then discard the component(Eq. (13));ω(k+1)j <ε/n

5. Update to , normalize the (Eq.(14)) and adjust , ;βk βk+1 Gk Gk+1 ω(k+1)j τ(k+1)ij u(k+1)ij 6. Update to (Eq.(15));7. Update to , to , to(Eqs. (8)?(10));8. endμ(k)j μ(k+1)j Σ(k)j Σ(k+1)j ν(k)j ν(k+1)j GEnd Θj ···Output: the finally obtain the parameters ,(j=1,2, ,G)



In summary, the pseudo code of this optim?ization algorithm for DSP?TMM can be de?scribed as Algorithm 2.

Algorithm 2: Optimization algorithm for DSPTMM Input: Given training samples ,component G, initial parameter from EPECM,final sample selection number , and self?paced parameter{x1,x2,··· ,xn}Θ nend α

Initialization , ;nselect In this section, two experiments are conduc?ted to prove the effectiveness and efficiency of our proposed method. In the first set of experi?ments, DSP?TMM will be evaluated for its dens?ity estimation performance in synthetic datasets.Moreover, DSP?TMM will be tested for its anti?noise ability in different noisy environments. To further verify its utility, the algorithm is applied to the two real?world datasets: CHNS for dia?gnosis of diabetes and MNIST for handwriting recognition. The following experiments are per?formed on a Windows 7 operating system,powered by Intel(R) Core(TM) i5?3 470 CPU@3.20GHz and 12 GB of RAM. The dataset consists of 4 clusters of Gaussi?an mixture model samples and added outliers;the parameters of each component are 3.1.2 Results In the pre?clustering, the threshold ε of the mixing coefficient is set to 1. The initial cluster number is 50 and the number of iterations is 50.The process is shown in Figs. 2b?2d. Figs. 2b?2d are processed for determining the number of clusters based on the EPECM algorithm. Fig. 2b shows that the number of components has dropped from 50 to 18 after 10 iterations, indic?ating that the algorithm can quickly reduce re?dundant components at the beginning. Fig. 2d shows that the number of clusters can remain at 4 after 50 iterations, following the settings of the synthetic data, indicating that the algorithm can effectively determine the number of components and parameters of the TMM. EPECM begins to fit the structure of the data, determine the num?ber of clusters, and obtain the parameters of the TMM. Then, DSP?TMM is employed to obtain the final parameters and the category identifica?tion of each data. Fig. 2 Processes of EPECM In the second phase, the ECM algorithm for DSP?TMM is used to eliminate the outliers and construct a robust clustering model. Let nendbe 95% of the total data, while self?pace parameter α= 1.2. The experimental results of KMeans,GMM, TMM, and DSP?TMM on the synthetic dataset are illustrated in Fig. 3. The DSP?TMM have the best performance as opposed to other algorithms. The ellipse lines in the figure repres?ent a 99% confidence ellipse. Due to the exist?ence of outliers, the cluster centers of other al?gorithms have serious offsets which affect the clustering accuracy. The results show that our optimization method can eliminate the influence of outliers and construct a robust t?distribution mixture model by selecting samples from each component with high confidence. Fig. 3 Cluster results of KMeans, GMM, TMM, DSP?TMM 3.2.1 Datasets To evaluate the robustness of the DSP?TMM algorithms, the experiments are conducted on 4 UCI datasets from the UCI machine learn?ing data repository, including Iris, Seeds, Thy?roid, Wine. The relevant information of the data is shown in Tab. 1. To evaluate the anti?noise ability of DSP?TMM to outliers, the 5% outliers are added into each UCI dataset. Tab. 1 UCI dataset description 3.2.2 Evaluation method The experiment uses mean square error(MSE), adjusted rand index (ARI)[27]and Fowlkes?Mallows index (FMI)[28]as evaluation indicator.They are calculated as where xiis the i th sample, μijis the cluster cen?ter of the j th component. The mean square er?ror is the average of the squared differences between all data points and the center of the cluster to which they belong. ARI calculates the similarity between the predicted value of the sample and the true value. FMI is the geometric mean of the combination of both recall and preci?sion. Note that for MSE, small values indicate better performance. Conversely, for ARI and FMI, larger values indicate better performance. 3.2.3 Results The experimental results are compared with KMeans[18], GMM[26], TMM[9], SPTMM[10]to fully evaluate the robustness of DSP?TMM. The cluster analysis is divided into four steps: ①Data preprocessing: feature selection for high?di?mensional feature data or dimensionality reduc?tion by PCA, and then normalization or stand?ardization of the data; ② Parameter initializa?tion: using the EPECM algorithm to adjust the threshold of the mixing coefficient to obtain the same number of components as the data set la?bel and initial parameters; ③ Sample identifica?tion excluding outliers: the parameters are fed in?to the second phase of DSP?TMM, and we set hyperparameters to ensure that the model elimin?ates the outliers to construct the TMM and identify the samples; ④ Performance evaluation:after identifying each sample, use the three eval?uation indicators of MSE, ARI, and FMI to eval?uate the performance of each algorithm. The experimental results on clean UCI data?sets are shown in Tab. 2. DSP?TMM method performs best with the ARI and FMI and on the three datasets. DSP?TMM is not sensitive to ini?tial parameter settings, and it can find the op?timal parameters by itself. Especially with over?lapping cluster data, its performance is better. Tab. 2 Cluster results of MSE, ARI, FMI on clean UCI datasets The experiment results on noise UCI data?sets are shown in Tab. 3. The results show that the DSP?TMM has the highest MSE, ARI, and FMI values on all datasets. Due to outliers in the data, the other algorithms cannot effectively im?plement cluster analysis, and the estimated para?meters deviate from the actual values. The per?formance of EPECM is better than other models(KMeans, GMM, TMM, SPTMM), which indic?ates that EPECM can address the issue of sensit?ivity to initial parameters in a noisy environ?ment and find the optimal parameters. Moreover,DSP?TMM, can effectively eliminate the influ?ence of outliers in order to construct the TMM by introducing self?paced learning. Tab. 3 Cluster results of MSE, ARI, FMI on noise UCI datasets 3.3.1 Datasets To further verify the utility and effective?ness of the algorithm, two datasets will be collec?ted from the actual task for cluster analysis. The first dataset is the China Health and Nutrition Survey (CHNS) in the healthcare field, which analyzes the association between extracted fea?tures and diabetes through clustering. It has 7 913 samples, containing 55?dimensional features that can be divided into non?invasive and invasive features. The second dataset contains handwrit?ten digits. It has 42 000 samples. Each image is represented by 28×28 pixels, each containing a value 0–255 with its grayscale value. 3.3.2 Evaluation method The experiment uses adjusted rand index(ARI)[27], Fowlkes?Mallows index (FMI)[28]as an evaluation indicator. ARI is used to calculate the similarity between the predicted value of the sample and the true value. FMI is the geometric mean of the combination of both recall and precision. Note that for MSE, small values indicate better per?formance. Conversely for ARI and FMI, larger values indicate better performance. 3.3.3 Results In the data preprocessing stage, feature ex?traction is performed for CHNS data, and 24?di?mensional features are extracted from 55?dimen?sional features according to the importance of features. For MNIST data, the PCA algorithm is used to reduce the features to 20 dimensions.Next, we normalize the data and use the cluster?ing model to perform ten cluster analysis. Fi?nally, we statistically analyze the results by cal?culating the average value and comparing them with other clustering methods. The experimental results in CHNS and MNIST are shown in Tab. 4. Tab. 4 Cluster results of ARI, FMI on real-world dataset Three observations can also be made from Tab. 4. ① DSP?TMM significantly outperformed the others on both ARI (0.109 2) and FMI (0.612 7)in CHNS dataset. Because the process of collect?ing diabetes data is error?prone, it inherently brings noise into the dataset. Our method intro?duces the self?paced learning to solve this prob?lem. ② DSP?TMM can effectively improve the clustering quality by introducing robust initializ?ation and discarding outliers. ③ The DSP?TMM algorithm can be used for actual datasets and have a strong utility. In this section, we mainly discuss the ad?vantages and potential limitations of DSP?TMM.Compared with KMeans, GMM, and TMM al?gorithms, our model can achieve outstanding per?formance tasked with clustering data in a noisy environment by introducing self?paced learning and robust initialization methods. Usually, out?liers tend to have large loss values, which is the basis for the combination of self?paced learning and mixture distribution models to effectively eliminate outliers. Compared to SPTMM, our model uses the ACS method to develop a new optimization algorithm, which focuses on select?ing high?confidence samples from each compon?ent during the iteration process. This optimiza?tion method not only effectively eliminates out?liers, but also maintains sample diversity in the selection. Moreover, DSP?TMM uses the EPECM initialization algorithm to address issues with sensitivity to initial parameters. However, since the parameter quantity of the mixture distribution model increases with the increase in sample dimension, our model cannot be applied to high?dimensional data. Therefore, it is necessary to perform dimensional reduction or feature extraction in the preprocessing stage be?fore clustering. In future research, we will focus on how to extend this algorithm to work with high?dimensional data. Moreover, we hope to of?fer contributions in a more clustering task. In this paper, a robust cluster analysis meth?od called the diversity self?paced t?distribution mixture model is proposed. The model takes the sample loss and learning speed of samples as con?straints to select a subset from all components for learning in each iteration, which can remove noisy values. After that, DSP?TMM utilizes the entropy penalty expectation conditional maxim?al algorithm as a pre?clustering step to set the initial parameters. This algorithm adaptively de?termines the number of clusters by introducing a competition mechanism, which fits the data from itself to improve the robustness and stability of the model. Experimental evaluations on a syn?thetic dataset show that DSP?TMM effectively filters out noisy values to achieve density estima?tion. In experiments in different noisy environ?ments, our model can effectively perform cluster analysis and perform best. Moreover, experiment?al results verify the utility in actual real?world datasets.3 Experiments
3.1 Density estimation on synthetic datasets 3.1.1 Datasets




3.2 Robust verification in different noise environments




3.3 Practical verification on real-world datasets


4 Discussion
5 Conclusion
Journal of Beijing Institute of Technology2020年4期