999精品在线视频,手机成人午夜在线视频,久久不卡国产精品无码,中日无码在线观看,成人av手机在线观看,日韩精品亚洲一区中文字幕,亚洲av无码人妻,四虎国产在线观看 ?

Gaussian mixture model clustering with completed likelihood minimum message length criterion

2013-01-08 08:39:04ZengHongLuWeiSongAiguo

Zeng HongLu WeiSong Aiguo

(1 School of Instrument Science and Engineering, Southeast University, Nanjing 210096, China)(2 College of Engineering, Nanjing Agricultural University, Nanjing 210031, China)

The Gaussian mixture model (GMM) is commonly used as a basis for cluster analysis[1-2]. In general, the GMM-based clustering involves two problems. One is the estimation of parameters for the mixture models. The other is the model order selection for determining the number of components. The expectation-maximization (EM) algorithm is often used to estimate the parameters of the mixture model which fits the observed data. Popular model selection criteria in the literature include the Bayesian information criterion (BIC), Akaike’s information criterion (AIC), the integrated likelihood criterion (ILC), etc.

However, most previous studies generally assume the Gaussian components for the observed data in the mixture model. If the true model is not in the family of the assumed ones, the BIC criterion tends to overestimate the correct model size regardless of the separation of the components. In the meantime, because the EM algorithm is a local method, it is prone to falling into poor local optima in such a case, leading to meaningless estimation. In order to approximate such a distribution more accurately, the feature weighted GMM, which explicitly takes the non-Gaussian distribution into account, is adopted in Refs.[3-8]. Nevertheless, the approaches in Refs.[3-8] assume that the data features are independent, which is often not the case for real applications. Based on the minimum message length (MML) criterion, Ref.[9] proposed an improved EM algorithm that can effectively avoid poor local optima. But we find that it still tends to select much more Gaussian components than necessary for fitting the data with uniform distribution, giving obscure evidence for the clustering structure of data.

We propose a novel method to address the model selection and parameter estimation problems in the GMM-based clustering method when the true data distribution is against the assumed one. In particular, we derive an improved model selection criterion for mixture models with an explicit objective of clustering. Furthermore, with the proposed criterion as the cost function, an improved EM algorithm is developed for estimating parameters. Ultimately, the proposed method is not only able to rectify the over-fitting tendency of some representative model selection criteria, but also able to avoid poor local optima of the EM algorithm.

1 Completed Likelihood of the Gaussian Mixture Model

Suppose that aD-dimensional sample follows aK-component mixture distribution, then the probability density function ofycan be written as

(1)

(2)

logp(Y|Θ)+logp(Z|Y,Θ)=

(3)

wherepnkis the conditional probability ofynbelonging to thek-th component and can be computed as

(4)

(5)

where

(6)

2 Clustering with Completed Likelihood Minimum Message Length(CL-MML) Criterion

2.1 Completed likelihood minimum message length cri-terion

(7)

(8)

(9)

2.2 Estimation of GMM parameters

For the GMM, each component follows the Gaussian distribution, i.e.,p(y|θk)=G(y|μk,Σk), whereμkandΣkare the mean and the covariance matrix of thek-th Gaussian components. For a fixed model orderK, we estimate the GMM parametersΘby an improved EM algorithm, with CL-MML in Eq.(9) as the cost function. The proposed EM algorithm alternatively applies the following two steps in thet-th iteration until convergence:

E-step: Compute the conditional expectation:

(10)

M-step: Update the parameters of the GMM by

(11)

(12)

(13)

3 Experiments

We present experimental results to illustrate the effectiveness of CL-MML for GMM-based clustering (denoted as GMM+CL-MML), compared to that of BIC (denoted as GMM+BIC), MML (denoted as GMM+MML), as well as the method utilizing the feature-weighted GMM and the integrated likelihood criterion (FWGMM+ILC) for clustering[3].

3.1 Synthetic data

We consider a synthetic 2D data set where data from each cluster follow the uniform random distribution:

ur(y1,y2)=

wherer={r1,r2,r3,r4} are the parameters of the distribution. 1000 data points are generated using a 5-component uniform mixture model. Its parameters are as follows:

w1=0.1,w2=w4=w5=0.2,w3=0.3

r1={-1.89, 4.07, 4.89, 7.94}

r2={1.11, 5.11, 2.47, 3.53}

r3={5.17, 6.53, 2.77, 5.77}

r4={4.31, 6.49, 6.29, 6.71}

r5={5.58, 8.42,-0.77, 2.23}

The Gaussian components are adopted to fit such a uniform mixture data set, for which the true distribution models are very different from the assumed ones. The models with the number of componentsKvarying from 1 toKmax, a number that is considered to be safely larger than the true number (i.e., 5), are evaluated.Kmaxis set to be 30 in this case. We evaluate these methods by the accuracy in estimating the model order and structure. Tab.1 illustrates the number of times that each order is selected over 50 trials. Fig.1 shows typical clustering results by these four methods.

It can be observed that for such a data set, the GMM+BIC approach not only fails to yield a good estimation of model order (see Tab.1), but also leads to a meaningless mixture model by the standard EM (see Fig.1(a)). Although the MML criterion generates a GMM which fits the data well, it suffers from severe over-fitting as shown in Fig.1(b) and Tab.1. Since the features are assumed to be independent in FWGMM, it also tends to select more components in order to approximate the distribution of data accurately (see Fig.1(c) and Tab.1).

Tab.1 Number of times for selected model orders over 50trials on synthetic data

In contrast, due to the introduction of an extra penalty to the MML criterion, the proposed CL-MML criterion-based GMM clustering favors much fewer but more “powerful” components which successfully detect the clusters. The clustering result in a typical successful trial of CL-MML is shown in Fig.1(d).

Fig.1 Typical clustering results of different methods on the synthetic data.(a) GMM+BIC;(b) GMM+MML;(c) FWGMM+ILC;(d) GMM+CL-MML

3.2 Real data

We also measure performance on four real-world data sets from the UCI repository. The number of classes, the number of samples and the dimensionality of each data set are summarized in Tab.2. For each data set, we randomly split the data 50 times into training and test sets. Training sets are created from 50% of the overall data points. We do not use any label in the training stage.Kmaxis still set to be 30. After model learning, we label each component by majority vote using the class labels provided for the test data, and we measure the test set classification accuracy as the matching degree between such obtained labels and the original true labels. The means and the standard deviations of the classification accuracy, as well as the number of components for each data set, over 50 trials are summarized in Tab.2. The best results are marked in bold.

Tab.2 Comparison of different clustering approaches on real data sets

Several trends are apparent. First, the numbers of components determined by the proposed method are generally less than those by the compared counterparts. This may be due to the reason that the distribution of a real data set often does not strictly follow the Gaussian mixture model, and most GMM-based clustering approaches tend to generate more components than necessary in order to better fit the data. However, it is found that the CL-MML can rectify the over-fitting tendency of the compared methods under such circumstances. This can be explained by the reason that it takes the separation among components into account. Secondly, the proposed method yields the most accurate results among all the approaches on these four data sets. This justifies that the proposed approach can estimate the GMM parameters more properly than the compared ones.

4 Conclusion

In this paper, by taking the capability of the candidate GMM to provide a relevant partition to the data into account, an improved GMM-based clustering approach is developed for the difficult scenario where the true distribution of data is against the assumed GMM. The experimental results show that the proposed method is not only able to rectify the over-fitting tendency of the compared methods for performing the model selection, but also able to obtain higher clustering accuracy compared to the existing methods.

[1]Zeng H, Cheung Y M. Feature selection and kernel learning for local learning based clustering [J].IEEETransactionsonPatternAnalysisandMachineIntelligence, 2011,33(8):1532-1547.

[2]Jain A K. Data clustering: 50 years beyondK-means [J].PatternRecognitionLetters, 2010,31(8):651-666.

[3]Bouguila N, Almakadmeh K, Boutemedjet S. A finite mixture model for simultaneous high-dimensional clustering, localized feature selection and outlier rejection [J].ExpertSystemswithApplications, 2012,39(7): 6641-6656.

[4]Law M H C, Figueiredo M A T, Jain A K. Simultaneous feature selection and clustering using mixture models [J].IEEETransactionsonPatternAnalysisandMachineIntelligence, 2004,26(9):1154-1166.

[5]Markley S C, Miller D J. Joint parsimonious modeling and model order selection for multivariate Gaussian mixtures [J].IEEEJournalofSelectedTopicsinSignalProcessing, 2010,4(3):548-559.

[6]Li Y, Dong M, Hua J. Localized feature selection for clustering [J].PatternRecognitionLetters, 2008,29(1):10-18.

[7]Allili M S, Ziou D, Bouguila N, et al. Image and video segmentation by combining unsupervised generalized Gaussian mixture modeling and feature selection [J].IEEETransactionsonCircuitsandSystemsforVideoTechnology, 2010,20(10):1373-1377.

[8]Fan W, Bouguila N, Ziou D, Unsupervised hybrid feature extraction selection for high-dimensional non-Gaussian data clustering with variational inference [J].IEEETransactionsonKnowledgeandDataEngineering,2012, in press.

[9]Figueiredo M A F, Jain A K. Unsupervised learning of finite mixture models [J].IEEETransactionsonPatternAnalysisandMachineIntelligence, 2002,24(3): 381-396.

[10]Wallace C S, Dowe D L. MML clustering of multi-state, Poisson, von Mises circular and Gaussian distributions [J].StatisticsandComputing, 2000,10(1): 73-83.


登錄APP查看全文

主站蜘蛛池模板: 久久这里只精品国产99热8| 国产精品偷伦视频免费观看国产| 欧美日韩激情| 日本人妻丰满熟妇区| 欧洲日本亚洲中文字幕| 中文天堂在线视频| av一区二区无码在线| 欧美日韩午夜| 久操线在视频在线观看| 国产高清毛片| 色噜噜在线观看| 国产理论一区| 亚洲天堂啪啪| 免费毛片a| 四虎国产精品永久一区| 免费国产小视频在线观看| 少妇精品网站| 98超碰在线观看| 国产超薄肉色丝袜网站| 看看一级毛片| 久久99精品久久久大学生| 米奇精品一区二区三区| 国产精品免费福利久久播放| 亚洲侵犯无码网址在线观看| 亚洲国产理论片在线播放| 国产va在线观看免费| h网站在线播放| 国产精品手机在线播放| 在线综合亚洲欧美网站| 久久精品66| 国内精品手机在线观看视频| 欧美精品H在线播放| 91口爆吞精国产对白第三集| 91成人在线免费视频| 成人午夜视频在线| 亚洲人成网站观看在线观看| 免费国产小视频在线观看| av性天堂网| 亚洲色欲色欲www在线观看| 亚洲日韩精品欧美中文字幕| 91偷拍一区| 思思99热精品在线| 69视频国产| 小说 亚洲 无码 精品| 国产性生大片免费观看性欧美| 日韩精品视频久久| 成年看免费观看视频拍拍| 久久这里只有精品2| 亚洲精品天堂自在久久77| 亚洲无码精彩视频在线观看| 亚洲一区毛片| 激情无码字幕综合| 91在线精品免费免费播放| 男女性色大片免费网站| 久久成人免费| 亚洲AV成人一区二区三区AV| 欧美人在线一区二区三区| 青草视频久久| 国产农村1级毛片| 亚洲一区二区三区麻豆| 91无码网站| 国产微拍精品| 欧美成人区| 小说 亚洲 无码 精品| 极品国产在线| 免费无码又爽又黄又刺激网站 | 国产玖玖视频| 波多野结衣的av一区二区三区| 国产精品 欧美激情 在线播放| 97视频在线精品国自产拍| 国产久操视频| 91国内在线观看| 狠狠v日韩v欧美v| 欧美色香蕉| 精品色综合| 精品剧情v国产在线观看| 精品三级在线| 亚洲日本中文综合在线| 久久影院一区二区h| 日韩国产亚洲一区二区在线观看| 中文成人在线视频| 国产一级毛片yw|