A Hadoop Performance Prediction Model Based on Random Forest

2013-06-06 04:17:26ZhendongBeiZhibinYuHuilingZhangChengzhongXuShenzhongFengZhenjiangDongandHengshengZhang

ZTE Communications 2013年2期

Zhendong Bei,Zhibin Yu ,Huiling Zhang ,Chengzhong Xu ,2,Shenzhong Feng ,Zhenjiang Dong ,and Hengsheng Zhang

(1.Shenzhen Institutesof Advanced Technology,Chinese Academy of Sciences,Shenzhen 518055,China;

2.Wayne State University,Detroit,Michigan 48202,USA;

3.Cloud Computingand ITInstituteof ZTECorporation,Nanjing210012,China;)

Abstract MapReduce is a programming model for processing large data sets,and Hadoop is the most popular open-source implementation of MapReduce.To achieve high performance,up to 190 Hadoop configuration parameters must be manually tunned.This is not only time-consuming but also error-pron.In this paper,we propose a new performance model based on random forest,a recently devel?oped machine-learning algorithm.The model,called RFMS,is used to predict the performance of a Hadoop system according to the system's configuration parameters.RFMSis created from 2000 distinct fine-grained performance observations with different Hadoop configurations.We test RFMSagainst the measured performance of representative workloads from the Hadoop Micro-benchmark suite.The results show that the prediction accuracy of RFMSachieves 95%on average and up to 99%.This new,highly accurate pre?diction model can be used to automatically optimize the performance of Hadoop systems.

Keyw ords big data;cloud computing;MapReduce;Hadoop;randomforest;micro-benchmark

1 Introduction

T he MapReduce programming model is widely used in big-data applications because it is simple to pro?gram and can handle large data sets.A popular open-source implementation of MapReduce is Apache Hadoop,which hasbeen used for web indexing[1],ma?chine learning[2],log file analysis[3],financial analysis[4],and bioinformaticsresearch[5].

With Hadoop,a programmer needs to manually tune up to 190 parameters to ensure high system performance.However,without in-depth knowledge of the Hadoop system,the pro?grammer may find such a task tedious and may even seriously degrade system performance.This issue has been confirmed by many researchers[6]-[9].

It is therefore desirable to automatically tune the configura?tion parameters.To this end,a performance prediction model based on historical observation is required.The key to improv?ing performanceistouse ahighly accurate model with low run?time overhead.Many researchers have tried to construct such a model.In[10],a set of cost-based mathematical functions is used to estimate the fine-grained run time of phases within the map and to reduce tasks when a job is executed.In[6],a course-grained SVM regression model is used to estimate the completion time of jobs that belong to a cluster.This estima?tion depends on the allocation of resources,parameter settings,and sizeof input data.

However,these models are not accurate enough because of their assumptions about cluster node homogeneity and over-simplifications.For example,the local and remote CPU and I/Ocostsduring each phase of executing a MapReduce job differs according to the Hadoop parameter settings,even in ho?mogenous nodes in a cluster.If we define operational cost as microseconds per megabyte,variations in parameter settings give rise to variations in operational cost(Fig.1).Also,Hadoop involves complicated multiphase processing.As well as con?taining map and reduce phases,a MapReduce workflow also contains fine-grained phases such as read,collect,spill,merge,shuffle,sort,and write.Each phase performs finer grained operations with different costs.For example,a spill phase involves combining,compression,and writing.All the abovefactorsmay be intertwined to contest underlying comput?ing resources,such as CPU and I/O.In such a context,if a model isoverly simplistic,it isdefinitely in accurate.

We propose a model to predict the performance of a Hadoop system.This model is based on fine-grained operations and us?es a random forest algorithm.Random forest is a recently de?veloped non-parametric regression model.Unlike traditional regression models,random forest combines tree predictions so that each tree depends on the values of a random vector sam?pled independently.This random vector has the same distribu?tion for all trees in the forest[11].We use random forest for two reasons:1)it does not make linear assumptions between parameters(in our case,the configuration parameters),and 2)it can deal with a large number of configuration parameters,even when therearecomplex interactions.

▲Figure1.Cost of representativeoperationswith 4325 random parameter settings.

In our proposal,a lightweight performance profiler captures the time taken to execute tasks and the size of the data pro?cessed in each phase.We construct a fine-grained model for predicting the performance of a Hadoop system.This model is used to accurately estimate phase-level performance under various workloads and without assuming cluster nodes are ho?mogenous.We evaluate RFMSunder different workloads gen?erated by Hadoop Micro-benchmark suite.The results show that RFMS provides prediction accuracy of 95%on average and up to 99%.

In section 2,we describe related work.In section 3,we intro?duceour approach based on random forest.In section 4,we de?scribe our experimental methodology.In section 5,we provide experimental results and describe some applications of our ap?proach.Section 6 concludesthepaper.

2 Related Work

2.1 Analyzing MapReduce Performance

Analyzing the performance of a MapReduce distributed sys?tem is challenging because a system potentially comprises sev?eral thousand of programs running on thousands of machines.Low-level performance details are hidden from users by using a high level dataflow model[9].Several approaches have been taken to understand user-defined workload behavior in a Ma?pReduce system[12]-[14].With these approaches,informa?tion collected from previous job execution logs is used to iden?tify performance bottle?necks.For example,Ha?doop job history files are often used to analyze per?formance bottlenecks in a Hadoop system.

Log files are often not exposed to developers,and this means that system per?formance often not opti?mal.In[9],logs are lever?aged through automatic log analysis to improve perfor?mance.Such an approach can be used to show the dataflow breakdown of the map/reduce phases for var?ious jobs,but human in?volvement is still needed for hotspot detection.

Performance analysis based on dataflow appears to be suit?able for Hadoop;however,the focus is on monitoring the data?flow process,and the user needs to identify possible bottle?necks in a Hadoop cluster.Furthermore,a dataflow approach cannot be used to evaluate performance when configuration pa?rameters are randomly set.Our model is based on workflow analysis and uses a random forest to predict the running time of each phase.This is a precise model that can be used to guide thesetting of configuration parameters.

2.2 Automatic Configuration

Various approacheshave been taken to automatically config?ure distributed data processing systems[15]-[18].The pay-per-use utility model of cloud computing creates new op?portunities to deploy the MapReduce framework.However,choosing which configuration parameter settings will result in high performance is complex[19].Automatic approaches to settingconfiguration parametersin Hadoop systems havethere?forebeen thefocusof attention recently.

The first model for predicting the performance of a Hadoop system with automatically set parameters is described in[10].The model describes the fine-grained dataflow and cost for phases within map and reduces tasks[10].However,it as?sumes that the costs for CPU and I/Oin each phase of execut?ing a MapReduce job are the same for all nodes in a cluster.In practice,this assumption isnot true(Fig.1).

AROMA[6]uses a performance model based on support vector machine(SVM)to integrate aspects of resource provi?sioning and auto-configuration for Hadoop jobs.Based on allo?cated resources,configuration parameters,and the size of in?put data,AROMA can estimate the completion time of jobs be?longing to a cluster.However,unlike fine-grained cost estima?tion models,it cannot quantitatively analyze the data processes of each phaseof a distributed system dataflow.

3 Performance Model Based on Random Forest

In thissection,wedescribe how tocapture the execution fea?tures of a job.We then use these features to construct a Ha?doop performance model based on a random forest.

3.1 Job Characterization

RFMSuses dynamic tools to collect run-time monitoring in?formation without modifying MapReduce workloads on Ha?doop.One such tool is BTrace—a safe,dynamic tracing tool that runs on the Java platform and captures the execution fea?turesof a MapReduce job[20].

The execution of a MapReduce job can be broken into the map and reduce stages.The map stage can be further divided into reading,map processing,buffer data collecting,spilling,and merging phases.Similarly,the reduce stage can be divided into shuffling,sorting,reduce processing,and writing phases.Each phase is part of the overall execution of the job in Ha?doop.

When Hadoop runs a MapReduce job,BTrace traces speci?fied Java classesto generate a task feature file.A feature file is a detailed representation of the task execution that captures in?formation at the phase level.The feature file generally logs exe?cution time,input data size,and output data size.However,the shuffle phase of the reduce stage requires special attention be?cause it contains multiple operations,such as network transfer?ring and merging.To simplify the operations in the phase and better analyze the result,BTrace only recordsthetiming of net?work transferring,and the timing of merging is added to the next phase,called sort.This phase only has merging opera?tions.When a job is finished,RFMScollects the feature files of all tasks and producesa statistical result of the three charac?teristicsof each phaseof ajob.

3.2 Building a Performance Model

As described in[10],the performance model for a MapRe?ducejob can begiven as

where F is the performance estimation model for a MapReduce job(p,d,r,c)that runs program p on input data d and uses cluster resources r and configuration parameter settings c.

Because a MapReduce workflow comprises nine phases,each of which is denoted Phases,the performance model for a whole MapReducejob can begiven by

where FPhasesis the performance model for each phase,and dsis the size of the data processed in Phases.The size of the ini?tial input data in the map and reduce stage determines the size of ds.The parameter settings related to Phasesare given by cs.Theperformancemodel for each phasecan beestimated by

where

and

In(3),FPerTaskPhasesistheperformanceof Phaseswhen ex?ecuting a single map or reduce task,and numTotalWaves is the total number of task execution waves.

In(4),totalTaskSlots is the sum of the task slots numTask?SlotPerNode allocated for map and reduce tasks in each node of a cluster.The total number of map or reduce tasks is num?Tasks.In thereducestage,numTasks isaconfigurableparame?ter.In the map stage,the configurability of numTasks depends on the size of input data d.The number of map tasks is given by

where totalDataSize is the size of input data d,and SplitData?Size is the size of the input split of each map task.(The default is64 MBif uncompressed.)

In(5),numNodes isthetotal number of nodesin acluster.

Function(3)allows us to estimate FPerTaskPhases;thus,it is necessary to learn n phase performance models for a work?load.These models are used to analyze the workflow so that we can use(2)to evaluateoverall performance.

Moreimportantly,by buildingaperformancemodel for FPer?TaskPhase,we can estimate Pf when the size of the input data is small.In turn,we can predict the workload performance when the size of the input data is larger by calculating numTo?talWaves in the map and reduce stage.

In[21],FPerTaskPhasesis calculated using a set of func?tions based on a constant cost assumption(a cost-based mod?el).It is assumed that the local and remote CPU and I/O costs per phase in executing a MapReduce job are equal across all thenodesin acluster with thesamehardwareresource.Howev?er,varying the configuration parameters may vary these costs and prove thisassumption false.

RFMSaddresses this problem by using a machine learning technique to construct FPerTaskPhasesmodels for different phases.RFMSuses the random forest(RF)regression model to estimate FPerTaskPhasesof a workload with varying configura?tion parameters.RF methodology is precise when there are re?gression problems and performs consistently well[11].Applica?tions that use RFs demonstrate that RF is one of the best of available methodologies for modeling the performance of com?plex Hadoop workloads in the cloud environment[22].RFMS can produce importance measures for each variable[23].These measures indicate which variables have the strongest ef?fect on the dependent variables being investigated.RFMScan capture the time taken to execute the phases of Hadoop work?loadswith varying configuration parameters.The ten configura?tion parameters that are important to overall performance in a Hadoop system are listed in Table 1.These parameters are used as feature candidates to train RFMS.For given input da?ta,the size of the uncompressed split data for each map task,given by USDFMT,does not change in the map stage.Thus,the size of the split data is not used to train RFMS.However,the input data(shuffle bytes)for each reduce task changes ac?cording to the mapred.reduce.tasks parameter.Furthermore,the size of the shuffle bytes significantly affects the execution time of a reduce task.Therefore,the shuffle bytes for each re?duce task ShuffleByteEachTask should be used as a feature candidate for the reduce stage.ShuffleByteEachTask is given by

where

and

Selectivity can be defined as the statistical ratio of the out?put size to input size for a stage or operation in a workload.In(7),SelectivityMstageis the map stage selectivity.Because map,combine,and compress operations reduce the input data size,SelectivityMstagecan be calculated using(8).In(8),Selectivity?Map,SelectivityCombine,and SelectivityCompressare operation selectiv?ity.These three selectivities are determined by related opera?tion,and they can be calculated using(9).In(9),OperationSe?lectivityOfTask is the ratio of output size to input size of an op?eration in a task,and n is the number of map tasks in a given workload.OperationSelectivityOfTask can be calculated using the data captured by our profiler in a feature file.The combine and compressoperationsareoptional,and thevalueof Selectiv?ityCombineand SelectivityCompressis set at 1 if the user does not specify theseoperations.

An RF can be defined as a collection of tree-structured pre?dictors[11]:

where hkis the k th individual tree,hk(.)is the tree's predic?tion,and Ntreesis the number of trees.The samples of the total training set are given by X={Fck,Ptk},where k=1,...,Nsamples.The predictor features are given by Fck,and the phase time is given by Ptk.The total training set is divided into two indepen?dent subsets:one to train the predictor hkand the other to test the predictor's accuracy.In(10),θkare random variables.The nature and dimensionality ofθkdepends on randomness in the construction of N trees.This randomness may be caused by the randomselection of Ntreestrainingrecordsdrawn from X with re?placement.It may also be caused by mtry,the random number of different features tried at each split of a tree.

▼Table1.Configuration parametersselected for testing

In RF regression,it is difficult to define the predictor fea?tures Fckthat allow the base predictors to be trained to accu?rately predict the target on the out-of-bag data.When select?ing important features,importance is assessed by replacing each feature with random noise and observing the increase in the mean squared error(MSE)for the out-of-bag validation.The features are then sorted by relative importance,and an im?portant subset of features can be abstracted.The RF is itera?tively retrained,and each time,the least important features are removed.In practice,Ntreesand Mtryare used to tune the RF model and minimize the MSE.Lower MSE can make the model better fit the training data.Ntreesand Mtryare tuned using scripts that iteratively change each parameter one-by-one and regenerate the regression model.The optimum value of Mtryranges from 6 to 10,and the optimal value of Ntreesranges from 500 to1000.

We use a stepwise regression model for the data sets collect?ed from our test bed comprising Hadoop nodes.In section 5,we discuss the accuracy of the prediction against the training dataand provideout-of-bagaccuracy statistics.

4 Experiment Methodology

The experiments are performed on a test bed comprising ten Sugon servers and a gigabit Ethernet network.Each server has a quad-core Intel(R)Xeon(R)CPU E5-2407 0 at 2.20 GHz and 32 GB PC3 memory.The cluster is virtualized by Xen 3.0.We create a pool of virtual machines(VMs)from the virtual?ized service cluster.Each has eight virtual CPUs and 8 GB memory.We then run the VMs as Hadoop nodes.Each VMus?es SUSELinux Enterprise Server 11 and Hadoop 1.0.4.

We designate one server to host the master VM node and use the remaining servers to host the nine slave VM nodes.The master node runs the JobTracker and the NameNode.Each slave node runs both the TaskTracker and the DataNode.Each VM is initially configured with four map slots,four re?duce slots,and 300 MBmemory per task.The data block is set at 64 MB.The RFMSprofile component needs to run on each slave VM node so that BTrace can capture the task execution times.Other components of RFMScan run on a separate VM or standalone machine because RFMSprocesses the gathered featuresoffline.

We run representative Hadoop workloads,such as TeraSort,to test the precision of RFMS with 10 different configuration parameters(Table 1).We use Hadoop benchmark to produce dataof varioussizes.

5 Results and Analysis

5.1 The Constant-Cost Assumption

We test the constant-cost assumption by comparing six cost features of TeraSort.Six different configurationswere used(Ta?ble 2).First,we randomly varied the configuration parameter values up to 4325 times when running TeraSort.The parame?tersgenerated within thetest range arelisted in Table1.Toob?tain credible results,we use the profiler in[10]to capture ev?ery cost feature.We use the plots of the TeraSort benchmark to show the results.Then,we select six cost features from a total of 12 cost features.These six cost features comprise CPU fea?tures(REDUCE_CPU,PARTITION_CPU and MERGE_CPU)and I/O features(READ_LOCAL_IO,WRITE_LOCAL_IO and NETWORK).From 4325 configurations,we select six that havetypical cost distributions.

Fig.2 shows six different values for each cost feature when configuration settings are varied.(Note the log scale on the y-axis.)For example,REDUCE_CPU ranges from 6610 to 9318 ms/MB,and MERGE_CPU ranges from 0.81 to 427 ms/MB.These values have significantly changed,and this indi?catesthat cost features are affected by variations in the configu?ration settings.

Although we do not show the cost features of other work?loads from Hadoop benchmark,from our observations,these features have similar properties to each other.Therefore,we believe the constant-cost assumption is false for the 12 opera?tions in a MapReduce workflow.

Furthermore,we determine whether the inconstant cost val?ue affects the accuracy of performance prediction.Fig.3 shows the NetworkTransferTimedistribution against the cost for NET?WORK when transferring 2040 MB data.This distribution is derived from actual measurement.NetworkTransferTime is the time taken to transfer networks in the shuffle phase and is giv?en by

▼Table2.Thesix typesof configuration parameter settings

▲Figure2.Costsfor six typesof parameter settingsusing Terasort benchmark.

where dShuffleSize is total shuffle size[21].In Fig.3,Network?TransferTime is significantly affected by inconstant Network?cost.We set dShuffleSize at a constant 2040 MB.The total shuffle size is generated by a map phase with 10 GB input da?ta.The cost of the network transfer is csNetworkcost.

Using(11),we can predict NetworkTransferTime when the configuration parameters are varied.We set csNetworkCost at a constant 8.826963634 ms/MB observed with default configu?ration parameters.In Fig.4,predicted NetworkTransferTime is plotted against the real measurement of NetworkTransferTime.The predictions shown in Fig.4 are poor,which suggests that theconstant-cost assumption isfalse.

5.2 Accuracy of RFMS

Fig.5 shows phase timing predicted using RFMSagainst the measured phase timing for a TeraSort workload with 10Ginput data.The phase timing is measured in 100 groups of experi?ments.

In Fig.5,phase timing predicted using RFMSis similar to the measured phase timing for a MapReduce workflow.Our prediction model is reasonably accurate.Although the predic?tions for read timing are not particularly accurate,the affected performance range is only between 2000 and 2500,which is low.To quantitatively evaluate the accuracy of our RFMSmod?el,we use the relative error Er,given by

▲Figure3.Network transfer timefor 2040MBof datawith inconstant network cost.

Where Preiisthe i th value predicted by RFMS,Realiis the i th value actually measured,and n is the total number of tests.Pre?cisely,we use average relative error of 100 predictions to represent the prediction error rate between each real and correspondingpredicted val?ue,selected from different ranges of phase timing distribution.

The error rates for the nine phases that we experimented with were cal?culated using Erand are shown in the Table 3.The error rate varies from 0.566%to 7.169%.All are less than 8%,and the average is 5%.This indicates that our RFMSpredic?tions are close to the measured phasetiming.

6 Conclusion

▲Figure4.Predicted network transfer timevs.measured network transfer time.

In this paper,we have confirmed that varying the configuration param?eter values significantly affects the precision of a cost-based model.Therefore,we have designed RFMS,a phase-based workflow perfor?mance analysis model for precisely tuning Hadoop parameters.We fo?cused on non-transparent perfor?mance tuning for various Hadoop workflow processes.

RFMS prediction is significantly better than that provided by a cost-based model.The average accuracy reaches 95%,which is reasonably good.Besides being accurate,our approach has several other advantages in practice.It has an improved,light?weight workload in-depth profiler that collects key task execu?tion information froman unmodified MapReduce/Hadoop work?load.It also has a novel dataflow analysis mechanism for Ha?doop.

▲Figure5.Thebest RFmodel for predicting phasetiming.

However,there are several aspects of RFMSthat need to be improved.We would like to further investigate the po?tential for self-configuration.We also need to reduce the RFMS overhead for online Hadoop self-tuning.

▼Table3.Relative Error of RFMS

Acknowledgment

This work is partially sup?ported by the cooperation project of Research on Green Cloud IDCResource Schedulingwith ZTECorporation.

ZTE Communications2013年2期

ZTE Communications的其它文章: ZTECommunications Call for Papers—Special Issueon“Cloud Computing”; ZTECloud Radio Solution to Usher in New Era of High-Performance LTENetworks; ZTEMakes Industry Breakthrough with 1 Gbps LTE-Advanced; A System for Detecting Refueling Behavior along Freight Trajectoriesand Recommending Refueling Alternatives; Spam Filtering:Online Naive Bayes Based on TONE; Parallel Spectral Clustering Based on MapReduce