Qiwei Zhong ,Yunlong Lin ,Junyang Zou ,Kuangyan Zhu ,Qiao Wang ,and Lei Hu
(1.School of Information Science and Engineering,Southeast University,Nanjing210096,China;
2.ZTECorporation,Nanjing210012,China)
Abstract Clustering is one of the most widely used techniques for explor?atory data analysis.Spectral clustering algorithm,a popular modern clustering algorithm,has been shown to be more effec?tive in detecting clusters than many traditional algorithms.It has applications ranging from computer vision and information retrieval to social science and biology.With the size of databas?es soaring,clustering algorithms have scaling computational time and memory use.In this paper,we propose a parallel spec?tral clustering implementation based on MapReduce.Both the computation and data storage are distributed,which solves the scalability problems for most existing algorithms.We empirical?ly analyzethe proposed implementation on both benchmark net?works and a real social network dataset of about two million ver?tices and two billion edges crawled from Sina Weibo.It is shown that the proposed implementation scales well,speeds up the clustering without sacrificing quality,and processes mas?sive datasets efficiently on commodity machine clusters.
Keyw ords spectral clustering;parallel implementation;massive dataset;Hadoop MapReduce;datamining
C lustering is a method of unsupervised learning that is widely used in exploratory data analysis[1].Re?cently,a spectral clustering algorithm was shown to be more effective than many other traditional algo?rithms in detecting clusters.Extracting valuable information from an ocean of data is a hot research topic,and applications of spectral clustering range from computer vision and informa?tion retrieval to social science and biology[1],[2].Spectral clustering cannot be adequately scaled in terms of computa?tional time and memory use to deal with massive datasets.For example,a quadratic resource bottleneck occurs when comput?ing pairwise similarities and the number of data instances n(Table 1)is large.A considerable amount of time is needed to compute the first k(Table 1)eigenvectors of a Laplacian ma?trix;therefore,it is necessary to develop dataset-oriented par?allel implementations[2],[3].In[4],message passing interface(MPI)is used to parallelize the spectral clustering algorithm.MapReduce is better than MPI and can automatically split mass data with Hadoop Distributed File System(HDFS).Better load balancing and data parallelism improve the scalability and efficiency of the machine clusters when a dataset is large.Therefore,a MapReduce framework is more suitable for han?dlingthisproblem.

▼Table1.Notation used in thispaper
MapReduce is a programming model and associated imple?mentation for processing large datasets[5].Users specify the computational rules in forms of a Map function that processes a<key,value>pair to generate a set of intermediate<key,val?ue>pairs.A Reduce function merges all intermediate values associated with the same key.Programs written in this way can be executed in parallel on large commodity machine clusters.The underlying runtime system automatically partitions the in?put data,schedules the program's execution,handles machine failures,and manages intermachine communication.This al?lows programmers who are inexperienced with parallel to easi?ly use the resources of a large machine cluster.The overall flow of a common MapReduce operation is shown in Fig.1.Both Google and Hadoop provide MapReduce runtimes that are flexible and tolerant to faults.
In this paper,we propose a parallel spectral clustering im?plementation(PSCI)that is based on Hadoop MapReduce and that can be applied to massive datasets.By constructing prop?er<key,value>pairs,theproposed implementation can beeffi?ciently executed in parallel.Tests on benchmark networks show the effectiveness of PSCI in detecting communities.We analyze a real social network dataset of about two million verti?ces and two billion edges crawled from Sina Weibo(a popular Twitter-like microblog popular in China)to show the efficien?cy and practicability of PSCIfor massivedatasets.
In section 2,we give a brief overview of spectral clustering algorithm in order to understand particular bottlenecks and to analyze the parallel partsof thealgorithm.In section 3,wepro?pose PSCI based on the Hadoop MapReduce framework.In section 4,we show the results of tests on PSCIand evaluate its clustering quality and scalability(in terms of runtime speedup and scaleup).Concludingremarksaremadein section 5.

▲Figure1.MapReduceexecution overview.
The most common spectral clustering algorithm is graph La?placian matrix[1],[6].We assume that G=(V,E)is a weight?ed undirected graph with vertex set V={v1,v2,...,vn}and edge set E={e1,e2,...,em}.Each edge between vertices viand vjhasanon-negativeweight wij≥0 and wij=wji.Theadjacen?cy matrix of thegraph is

where wij=0 means that viand vjare not connected along an edge.Then,thedegreeof a vican bedefined as

where the sum in(2)only runs over all vertices that are adja?cent vi,on account that the weight wij=0 for all other vertices vj.The degree matrix of the graph is

where the degrees d1,d2,...,dnare on the diagonal.Based on the adjacency matrix and degree matrix,an un-normalized graph Laplacian matrix is given by L=D-W,and the normal?ized graph Laplacian matrix is given by

We now assume that a dataset comprises n in?stances x1,x2,...,xn,which can be arbitrary ob?jects.The pairwise similarity between xiand xjis given by sij=s(xi,xj)and can be measured by a similarity function that is non-negative and sym?metric.The corresponding similarity matrix is giv?en by

An example of a similarity function is the Gauss?ian function

where the parameterσcontrols the number of the neighbors.Nevertheless,sijin network is usually defined as

If we replace W in(4)with S,we obtain the new formula in a spectral clusteringalgorithm:

Furthermore,one often reduces the matrix S to a sparse one by considering only significant relationship between instances for conserving the computational time.A summary of the spec?tral clusteringalgorithmisshown here[1].

Algorithm1.Spectral Clustering Algorithm Input:the similarity matrix S∈R n×n,and thenumber of desired clusters k.Output:k Clusters.Procedure:1.Construct asimilarity graph and let W beits weighted adjacency matrix.2.Computethenormalized Laplacian matrix L sym.3.Computethefirst k eigenvectors u 1,u 2,...u k of L sym.4.Let U∈R n×k bethematrix containingthevectors u 1,u 2,...u k ascolumns.5.Form thematrix T∈R n×k from U by normalizingtherowstonorm1.6.For i=1,...,n,y i∈R k bethevector correspondingtothe i th row of T.7.Cluster thepoints(y i)i=1,...,n with the k-Means algorithm intoclusters C 1,C 2,...,C k.8.Output clusters A 1,A 2,...,A k with A i={j y j∈Ci}.
Generally,it is useful to change representation of the ab?stract data points xito points yi∈Rkdue to the properties of graph Laplacian matrix.It enhances the cluster performance,so that clusters can be trivially detected in the new representa?tion.In particular,thesimple k-meansclustering algorithmde?tectstheclusterswithout any difficulties.

Algorithm2.k-Means Algorithm
Input:thedataset comprised of instances,and thenumber of desired clusters k.Output:k Clusters.Procedure:
1.k initial group centroids(so-called“Means”)arerandomly selected from thedataset.
2.k clustersare created by associatingevery object with thenearest centroid.
3.The positionsof the k centroids arerecalculated when all objectshave been assigned.
4.Step 2 and 3 arerepeated until thecentroidsnolonger move(or convergenceis reached).
There are three intensive computing processes in a spectral clusteringalgorithm:construction of the Laplacian matrix,com?putation of the first k eigenvectors,and calculation of distanc?es in k-means.Therefore,after preprocessing the original data?set,we reasonably segment the similarity matrix computation and sparse computation by data point index in order to con?struct the Laplacian matrix.When calculating the eigenvec?tors,we put the Laplacian matrix on an HDFSand launch dis?tributed Lanczos operations to get the first k eigenvectors of the Laplacian matrix.Finally,we make parallel k-means clus?tering algorithms on the eigenvector's transposed matrix to get the final clustering results.The entire process of our parallel implementation isshown in Fig.2.
The Laplacian matrix is constructed in two steps:computa?tion of similarity matrix and sparsification of the similarity ma?trix.Fortunately,both computation of the pairwise similarities and t-nearest neighbors of one object are not related to the computation of those of other objects.Therefore,the computations for different objects can be done in parallel via dataset partitioning(Fig.3).
3.1.1 Mapper:Calculation of Similarity
Input:<key,value>pair,where key is the unique index and features of one object and value is the index and corresponding features of each oth?er object whose index is greater.
Output:<key′,value′>pair,where key′is the index of one object,and value′is the index of each other object and its similarity with the object in key′.
3.1.2 Reducer:Construction of the Sparse Matrix
Input:<key,value>pair,where key is the index of one object and value is the index of each other object and itssimilarity with theobject in key.
Output:<key′,value′>pair,where key′is the index of oneobject and value'isthe index and cor?responding similarity of each other object among t-nearest neighborsof theobject in key′.
After we have calculated and stored the Laplacian matrix,we must parallelize the eigensolver.The Lanczos algorithm is an iterative algorithm.It is an adaptation of power methods and is used to find eigenvalues and eigenvectors of a square matrix or the singular value decompositions of a rectangular matrix.It is particularly useful for finding decompositions of very large sparse matrices.

Algorithm3.Lanczos Algorithm Input:the Laplacian matrix.Output:atridiagonal matrix T and a Q matrix.Procedure:1.Initialization.q 0←0,β0←0.q 1←random vector normalized 1.2.Iteration.For k=1,2,...,m wk=L symq k-βkq k-1.αk=(w k,q k).wk=wk-αkq k.βk+1=‖wk‖.q k+1=wk/βk+1.3.Return.Q n×m=(q 1,q 2,q 3,...,q m-1,q m).α1 β2 0 β2 α2 β3 β3 α3→→→ → →→→→ → →→→→→→→ → →Tmm= .………βm-1 βm-1 αm-1βm 0 βmαm

▲Figure2.Parallel spectral clustering implementation process.

▲Figure3.Laplacian matrix construction.
The Lanczos iteration converts the Laplacian matrix Ln×ninto a real symmetric tridiagonal matrix Tm×m(m<n),through which eigenvalues and eigenvectors can be easily solved by methods such as QR Iteration.A<eigenvalue,eigenvector>pair of L,which is(λk,Qn×),corresponds to the<eigenvalue,eigenvec?tor>pair of T,which is(λk).In each iteration of the Lanc?zos algorithm,matrix-vector multiplication is the most inten?sive calculation,and can be done in parallel via matrix parti?tioning(Fig.4).
3.2.1 Mapper:Matrix-Vector Multiplication
Input:Global vector V.<key,value>pair,where key is the line index of matrix,and value is the content corresponding to theindex in key.
Output:<key′,value′>pair,where key′is NULL,and val?ue′is the line index and multiplication result between value and vector V.
3.2.2 Reducer:Construction of the Multiplication Result
Input:<key,value>pair,where key is NULL,and value is all thepartial results.
Output:<key′,value′>pair,where key′is NULL,and val?ue′is the final result vector of matrix-vector multiplication.
Computations of the distances between one object and the centers are not related to computations of the distances be?tween other objects and corresponding centers[7].Therefore,the computation of distances between different objects and centers can be done in paral?lel.The new centers,which will be used in the next iteration,should be updated in each itera?tion.Therefore,the iterative procedures must be executed serially(Fig.5).
3.3.1 Mapper:Associate Every Instancewith the Nearest Center
Input:Global variable centers,<key,value>pair,where key is the index of one instance and value is the feature(i.e.the dimension values)cor?responding to the instance in key.
Output:<key′,value′>pair,where key′is the index of thenearest center of the instance and val?ue′istheindex and featureof theinstance.
3.3.2 Combiner:Calculation of the Sumand Quadratic Sumof Values of Each Dimension of Instancesand its Number Assigned tothe Same Center on Local Disk
Input:<key,value>pair,where key is the in?dex of the center,and value is the index and fea?ture of each instance assigned to the same center
Output:<key′,value′>pair,where key′is the index of thecenter,and value′isthesumand qua?dratic sum of values of each dimension of instances and its number with thesamecenter.
3.3.3 Reducer:Calculate the New Centers and
Iteration Condition.
Inpu t:<key,value>pair,where key is the index of the cen?ter and value comprises the sum and quadratic sum of the val?ues in each dimension of instances with the same center from all hosts,and the number of those instances.
Output:<key′,value′>pair,where key′is the index of the new center and value′is thefeaturesrepresentingthenew cen?ter.
We designed our experiments to validate the scalability and quality of parallel implementation in PSCI.Our experiments are based on both computer-generated benchmark networks and a real social network dataset of 1,628,853 vertices and 176,620,423 edges crawled from Sina Weibo beforehand.We ran the experiments on IBM Blade Cluster with 10 compute nodes.All the nodes were identical,and each was configured with a CPU faster than 2 GHz and memory greater than 8 GB.Hadoop version 0.20.0 and Java1.6.0 wereused asthe MapRe?ducesystemfor all experiments.

▲Figure4.Matrix-Vector multiplication.

▲Figure5.Parallel k-meansalgorithm.
To test the performance of parallel implementation,we use Modularity as an evaluation function.Modularity is a property of a network and a specific proposed division of that network into communities[8],[9].It measures whether the division is good in the sense that there are many more edges within com?munities but only a few in between.The higher the Modulari?tyscore,the better the clustering quality.Generally,this score fallsbetween 0.3 and 0.7 in social networks[9].
If G=(V,E)with theadjacency matrix A isconstructed of

then we assume that the vertices are divided into communities sothat vertex u belongstocommunity cu.The Modularity Q is

Also,

where eij(i≠j)is the fraction of edges that join vertices in community i to vertices in community j,and eiiis the fraction of edges in the same community i.Theδfunctionδ(i,j)is 1 if i=j and 0 otherwise;duisthe degree of vertex u;aiisthe frac?tion of ends of edges that are attached to vertices in community i and m is the number of edges in the graph.
Benchmark networks are standard measures for communi?ty-detection algorithms[10].Here,we generate the benchmark networks with parameters<k>=120,β=1,γ=2,μ=0.2,kmax=1000,smin=500,and smax=2000.First,we test perfor?mance of PSCI with network sizes ranging from 20,000 to 500,000.The results are shown in Fig.6 and Fig.7.Modulari?ty is high for different network sizes and falls between 0.4 and 0.7.This supports the resultsin[9].Fig.7 shows that PSCIhas very good scalability.Computational time almost becomes lin?ear as the network grows above 100,000.This means that PSCI hasgood scalability and treatsmassivedatasetsefficiently.
When determining the speedup with different machine num?bers on a benchmark network of 500,000 instances(Fig.8),we see that in the beginning,the implementation has nearly linear substantial speedup as the number of machines increases.Then,there is a slowdown caused by the overhead time of framework startup and required intermachinecommunication.

▲Figure6.Modularity of clusteringresultswith different network sizes.

▲Figure7.Runtimesof PSCIfor different network sizes.

▲Figure8.Theruntimesof PSCIwith theincreasingnumber of machines.
Here,we analyze the quality and speedup on a real social network of 1,628,853 vertices and 176,620,423 edges crawled from Sina Weibo.The edge represents the following relation?ship between one vertex and another.Table 2 shows Modulari?ty and runtimes of processing on the real social network on 10 machines.The visual clustering using the adjacency matrix plot is shown in Fig.9.The results indicate that our parallel implementation speeds up the clustering process without sacri?ficingquality.

▼Table2.Modularity and runtimeson thereal network

▲Figure 9.A visual clustering result using the adjacency matrix plot of the real network.The horizontal and vertical coordinates represent the vertices,and the diagonal is a clear line that indicates the good quality of theclustering.
In addition,the runtime for dealing with Picasa(a dataset of 637,137 images)on 16 machines is nearly 15,000 seconds[4].This parallelizes the spectral clustering algorithm based on MPI.As a consequence,our implementation is faster for paral?lel processing because both the dataset and spectral clustering algorithmareboth distributed.
The spectral clustering algorithm is one of the most impor?tant modern algorithms and has been shown to be more effec?tive in community detection than many traditional algorithms.However,the growing amount of data in applications makes spectral clustering of massive datasets challenging.In this pa?per,we propose a parallel spectral clustering implementation based on MapReduce.In our PSCI,both the computation and data storage are distributed,and this solves the problems of most of the existing algorithms mentioned at the outset of this paper.By empirically analyzing both benchmark networks and a real massive social network dataset,we show that the pro?posed implementation has high Modularity across different net?work sizes;it reduces the computational time so that it is al?most linear;and the substantial speedup is nearly linear as the number of machines increases within a certain range.The im?plementation scales well,speeds up clustering without sacrific?ing quality,and processes massive datasets efficiently on com?modity machine clusters.