A Case Study on Intelligent Operation System for Wireless Networks

2019-06-16 04:01:58

ZTE Communications 2019年4期

(ZTE Corporation,Shenzhen,Guangdong 518057,China)

Abstract:The emerging fifth generation (5G) network has the potential to satisfy the rapidly growing traffic demand and promote the transformation of smartphone-centric networks into an Internet of Things (IoT) ecosystem.Due to the introduction of new communication technologies and the increased density of 5G cells,the complexity of operation and operational expenditure(OPEX) will become very challenging in 5G.Self-organizing network (SON) has been re?searched extensively since 2G,to cope with the similar challenge,however by predefined poli?cies,rather than intelligent analysis.The requirement for better quality of experience and the complexity of 5G network demands call for an approach that is different from SON.In several recent studies,the combination of machine learning(ML)technology with SON has been investi?gated.In this paper,we focus on the intelligent operation of wireless network through ML algo?rithms.A comprehensive and flexible framework is proposed to achieve an intelligent operation system.Two use cases are also studied to use ML algorithms to automate the anomaly detection and fault diagnosis of key performance indicators (KPIs)in wireless networks.The effectiveness of the proposed ML algorithms is demonstrated by the real data experiments,thus encouraging the further research for intelligent wireless network operation.

Keywords:5G;self-organizing network;machine learning;anomaly detection;fault diagnosis

1 Introduction

The wireless communication technologies have expe?rienced significant advancement over the past three decades,from the first generation (1G) system to fourth generation (4G) networks.The cellular net?works successfully transform from pure telephony systems to versatile networks that can transport rich multimedia con?tent and have a profound impact on our daily life.The rapid development of the mobile Internet generates a tremendous amount of traffic and consequently requires more bandwidth and better quality of experience.The next-generation wire?less networks,i.e.,the fifth generation (5G) cellular net?works,which are assumed to be commercially deployed in 2020,have the potential to satisfy such a rapidly growing demand for data traffic[1].

The 5G networks mainly have three types of scenarios [2]:first,the enhanced mobile broadband (eMBB) aims to provide broadband multimedia to human-centric use cases; second,the ultra-reliable low latency service (URLLC) with strict require?ments in terms of latency (ms level) and reliability (five nines and beyond)is used for remote control of robots or tactile Inter?net applications; third,massive machine type communications(mMTC) is mainly used to connect a very large number of de?vices and transmit a low load of non-delay-sensitive informa?tion.It is believed that 5G will significantly promote the trans?formation of the smartphone-centric networks into an Internet of Things (IoT) ecosystem [3] that integrates a heterogeneous mix of wireless-enabled devices ranging from smart-phones to connected vehicles,drones,wearables,sensors,and virtual re?ality devices.The throughput will be 1 000 times more in ag?gregate from 2015 to 2020 and the number of devices will grow to 500 billion [4].In order to achieve the capacity growth,5G cells have to be densely deployed,about 40 to 50 times as many as 4G networks.Moreover,a typical 5G node is expect?ed to have 2 000 parameters to be configured and optimized,significantly more than a typical 2G node (500 parameters),a 3G node (1 000 parameters) and a 4G node (1 500 parameters)[5].It is foreseen that the network operations of 5G will be?come an enormous challenge.As estimated in [5],there will be 53 to 67 times increase in operational complexity in 5G compared to 4G.

The operational expenditure (OPEX) is always an important issue for wireless networks.The idea of self-organizing net?work(SON)was evolved in 2G,3G and 4G.However,the auto?mation is realized by predefined policies,rather than intelli?gent analysis and smart decisions.It is time-consuming and expensive for 5G operators to operate and configure the net?work all manually by humans.In order to reduce the OPEX and facilitate the efficiency of the next generation networks,several studies have investigated the benefits of applying ma?chine learning (ML) and big data technology in SON,showing promising results [5]–[8].The ML engine has the potential to automate many scenarios of SON,for example,node deploy?ment planning,advanced load balancing,resource allocation strategy,quality-of-experience (QoE)/quality-of-service (QoS)analysis,and network monitoring,paving a way to a proactive,self-aware,self-adaptive and highly efficient networking.In this paper,we focus on the intelligent operation of wireless net?work through applying ML technology.

This paper is organized as follows.In Section 2,the ML pre?liminaries are demonstrated,and a framework of intelligent op?eration system designed by layered scheme is proposed.Then two use cases are illustrated,which use ML algorithms to auto?mate the anomaly detection and fault diagnosis of key perfor?mance indicators (KPIs) in wireless networks.Promising re?sults for on-site data analyses are shown in Section 3.Finally,we draw the conclusions in Section 4.

2 Framework of the Intelligent Operation System

2.1 Machine Learning Preliminaries

ML technology has attracted wide attentions for several de?cades,especially with the third wave of artificial intelligence(AI) facilitated by rapid developments of deep neural net?works,big data analysis and cloud computing.ML is being applied to more and more areas,for example,image process?ing,face recognition,speech recognition,natural language processing,computational advertising,recommendation sys?tem,and automated driving.Depending on the type of data input and output,and the type of task or problem intended to solve,there are three main categories of learning algorithms as follows:

1)Supervised Learning.

A supervised learning algorithm is fed with a set of data that contains both the inputs and the desired outputs.The data is known as the training data that consists of a set of training ex?amples.Through iterative optimization of an objective func?tion,a supervised learning algorithm aims to determine a gen?eral rule that can nicely map inputs to outputs.There are a number of popular supervised learning algorithms which have been developed and achieve successful applications,for exam?ple,regression model (RM),support vector machine (SVM),hidden Markov model(HMM),random forest(RF),and time se?ries forecasting.In wireless networks,these models have the potential to solve a number of problems.Fox example,in mas?sive multi-input multi-output (MIMO) systems associated with hundreds of antennas,both detection and channel estimation lead to high-dimensional search-problems,which can be ad?dressed by these models to estimate or predict radio parame?ters that are associated with specific users [9].Forecasting the trend of user equipment (UE) mobility or the traffic volume of different services is another possible application.

2)Unsupervised Learning.

Different from the aforementioned supervised learning,the input information for unsupervised learning does not contain priori labels.Therefore,the unsupervised learning algorithm has to rely on its own capability to find the embedded structure or pattern from its input,like grouping or clustering of data points.The typical unsupervised learning algorithms include K-means clustering,principal component analysis (PCA),inde?pendent component analysis (ICA),one-class SVM,etc.The K-means clustering was studied in [10] to partition the mesh access points (MAPs) into several groups in a hybrid optical/wireless network scenario,in order to optimize both the gate?way partitioning and the virtual-channel allocation.K-means clustering can also be used to detect network anomaly.PCA and ICA are two common algorithms used for signal processing and feature dimension reduction.They can be developed for the physical layer signal dimension reduction of massive MI?MO systems to reduce the computational complexity or in the area of anomaly-detection,and fault-detection problems of wireless networks with multi-performance data monitoring.

3)Reinforcement Learning.

Inspired by both control theory and behaviorist psychology,reinforcement learning is an area of machine learning regarded with how software agents ought to take actions in an environ?ment so as to maximize some notion of cumulative reward.Many reinforcement learning algorithms use dynamic program?ming techniques and do not assume explicit knowledge of whether they have come close to its goal.They are used when exact models are infeasible.Due to its generality,the field is studied in many other disciplines,such as control theory,oper?ations research,information theory,multi-agent systems,and swarm intelligence.There are several typical reinforcement learning algorithms,for example,Markov decision processes(MDP),partially observable Markov decision process (POM?DP),Q-learning,and multi-armed bandit (MAB).In conjunc?tion with MDP models,Q-learning has been extensively ap?plied in heterogeneous networks.As in [11],the authors pre?sented a heterogeneous fully distributed multi-objective strate?gy for the self-configuration/optimization of femto cells.The re?inforcement learning methods can also be applied in problems like spectrum sharing for device-to-device networks and ener?gy modeling in energy harvesting.

The three categories of machine learning algorithms and the typical methods in each category are summarized in Fig.1.

2.2 Intelligent Operation System Design

Although ML can be applied in a number of aspects in SON,this paper focuses on the application of ML technology in the intelligent operation and maintenance (O&M) of wireless net?works.Fig.2 demonstrates a possible implementation for the framework of intelligent operation system.The system is de?signed in such a layered manner as to maximize the flexibility,scalability,and manageability.The system consists of four lay?ers:the data governance layer,engine layer,model & semantic layer,and application layer.Detailed description of each layer is demonstrated as follows.

1)Data Governance Layer.

The original data are collected,screened and transformed in this layer.Data is the fundamental ingredient for successful implementation of ML.In the wireless network system,diverse kinds of and large amount of data are produced from individual modules,which contain valued information for network mainte?nance.Examples of data include KPI data,key quality indica?tor (KQI) data,alarm data,configuration data,log data,etc.The data could be collected in three ways.Historical data are collected from a wide range in the history,mainly used for mod?el training.Online data are collected automatically in realtime,which are used for online application of the trained mod?el,such as anomaly detection of KPIs.Label data are collect?ed by labeling tools and used to train supervised machine learning algorithms or improve the algorithm performance.For example,the operation expert can label each data point of a KPI whether anomalous.Then,these label data can be used to train an anomaly detection model.The collected original data are managed with an extract-transform-load(ETL)process,pro?ducing the dimensional data,merged data,topic data or train?ing data.Dimensional data are produced from original data ac?cording to different perspectives,for example,KPIs could be classified into accessibility indicators,retainability indicators,mobility indicators,etc.The original data could be merged spatially or temporally,for example,the cell-level KPIs are merged into sub-network level.Original data could be orga?nized into topic data according to application scenario,for ex?ample,traffic flow data are used to network traffic monitoring.Training data are the final processed data that are able to cali?brate the ML algorithms.

▲Figure 2.A general overview of intelligent operation system.

2)Engine Layer.

The engine layer provides a number of common engine mod?ules for the upper application layer.The batch computing en?gine processes the off-line and high volumes of data,and the data often spread a wide period of time.A famous technology used for batch computing is the Hadoop Map/Reduce.The streaming computing engine is suitable for processing data in real-time,usually used for the computing of online ML models after they have been trained off-line.The training engine sup?plies a framework with training ML models.It defines several standard steps to train a ML model,such as data normalization,feature extraction,feature selection,model training,and postprocessing.The rule engine and inference engine can be used to build expert systems,which are essentially composed of two sub-systems:the knowledge base and the inference engine.Both forward chaining and backward chaining reasoning modes are available in such a engine.The workflow engine provides tools for managing the processes of developing a ML applica?tion.It facilitates the organization of such modules as the data reading module,data preprocess module,training module,and online testing module.The engine layer can include other en?gine types,which are not showed here.

3)Model&Semantic Layer.

The model & semantic layer provides several abstract mod?els and basic libraries to fulfill an end ML application.The network element(NE)model defines several explicit mathemat?ical models of individual wireless network modules,for exam?ple,the communication model in physical layer,the device pa?rameters of some physical components,the exact relationship between some KPIs,and the network topology of different ele?ments.The metadata model is adopted to define some general concepts when a set of objects share the same attributes,opera?tions,relations,and semantics.For example,a time series metadata model is formulated to represent all those data (KPIs/KQIs)of time series nature.The metadata model should define several common attributes:sampling frequency,time-range,pe?riod,sampling value,time-stamps,and etc.The expert rule li?brary collects a number of rules defined by O&M experts.These rules can be used as input to ML algorithms or to im?prove the performance of the algorithm.For example,the ex?perts can define the correlation of some alarms,for instance,one KPI is the root cause of another KPI.The algorithm li?brary collects plenty of ML algorithm modules used for devel?oping ML applications.As mentioned above,the ML algo?rithms include SVM,HMM,RF,ICA,PCA,K-means cluster?ing,and so on.

4)Application Layer.

The application layer includes a number ML applications de?veloped for facilitating the intelligent O&M of wireless net?work.These applications are produced by utilizing the compo?nents from the lower layers.They are usually developed case by case,to solve practical O&M problems and should be easily used by operation personnel.TopN analysis application would automatically show the top-n cells whose QoS are poor,such as with a high drop call rate,low connection rate,and low paging success rate.The TopN analysis is one of the most common functions for network maintenance.Its automation can signifi?cantly reduce the load of an O&M engineer.The anomaly de?tection application is used for automating the process of fault detecting in the network.Fox example,whether abnormal in each point of a KPI can be detected depends on dynamic threshold technology.Comparing with the static threshold con?figured by manner,a ML-directed dynamic threshold has the potential to improve detection accuracy and efficiency.Root cause analysis could be used for automatic association or corre?lation analysis between different events and detect the root cause,like an alarm or a detected KPI anomaly.The root cause analysis is critical for fault diagnosis and fault recovery.Prediction analysis is useful for QoS/QoE or other variable pre?diction according to historical and current state of the net?work.It is a critical step toward proactive operation of the sys?tem with possible applications like fault prediction,load bal?ance,and capacity plan,consequently reducing the fault rate and increasing the resource utilization.It is worth noting that here only a few examples are enumerated and many other ap?plications would be developed according to different require?ments.

3 Use Cases

The aforementioned framework illustrates a unified solution for implementing an O&M operation system.In this section,two use cases will be described in detail,domestrating the ML algorithms developed for anomaly detection and anomaly diag?nosis with KPIs.They are the example functions of the anoma?ly detection application and root cause analysis application in Fig.2.

3.1 Anomaly Detection with KPIs

The KPI anomaly detection is quite important for network maintenance.Due to the complexity of a 5G network that con?tains numerous radio nodes and other components,there are a huge amount of KPI data to be monitored,which may be time consuming,error-prone and even impossible.An ML-based anomaly detection method is proposed in this paper,as shown in Fig.3.It is essentially composed of three modules:anomaly detection,anomaly scoring,and feedback modules.The anom?aly detection model and scoring model are trained with off-line data,using the batch computing engine and training engine in Fig.2.Then,the KPIs data are detected online based on the streaming computing engine.The KPI data point whose anom?aly score is higher than a predefined threshold will be noticed to the O&M engineer and the engineer can label it whether ab?normal,providing feedback to the training module to improve the algorithm performance.

The KPIs represent varied characteristics because of the di?verse characteristics of network modules.For example,some KPIs show periodicity while others do not; some KPIs have trend,while the other KPIs are stable.A two-stage modeling method is proposed in this paper to deal with the huge chal?lenge for comprehensive modeling of all kinds of KPIs.As shown in Fig.4,the first stage is the classification stage,where a time series clustering algorithm is formulated to classify the KPIs based on their structure characteristics.In the second stage,the module selects an appropriate time series model for each KPI category,predicting the normal baseline at each time point for a KPI.Avalue would be denoted as anomaly if it ex?ceeds the baseline of the online detection.

The time series clustering method based on structural fea?tures has been introduced in [12],which proposed a hierarchi?cal scheme to reduce the complexity of clustering.Firstly,the time series are classified into two main categories:the signifi?cant periodicity and non-significant periodicity,based on Fou?rier transformation.Secondly,the k-means algorithm is used to cluster the time series in each main category base on seven features extracted from the KPI series.In the first stage,the frequency amplitude spectrum of a KPI is calculated by dis?crete Fourier transformation(DFT)as follows:

We denote the maximum,mean and standard deviation of the amplitude spectrum as |F|max,|F|mean,and |F|std.If satisfying|F|max>|F|mean+c?|F|std,wherecis a predefined coefficient larger than 3,the KPI would be classified as significant periodic?ity,otherwise non-significant periodicity.Please refer to [12]for the more detailed descriptions of the clustering process.

When a KPI is classified,a suitable time series model will be selected according to its characteristic.There are a number of candidate models available,such as density estimation,Olympic model,regression model,Holt-Winters model,and au?to-regressive integrated moving average (ARIMA)[13].Fox ex?ample,if a KPI contains trend and periodicity,the Holt-Win?ters model is able to model it as following:

▲Figure 3.An illustration of machine learning (ML)-based anomaly de?tection method.

▲Figure 4.A demonstration of two-stage time series modeling method.

wherelt,bt,andstare the level component,trend component and seasonal component respectively,andmis the period of time series.The forecasting value athstep would be:

where.When the prediction value and fitting errors in historical data are calculated,the normal baseline could be formulated as:

wherepercentile of standard Gaussian distribution andαHis the standard deviation of fitting errors in historical data.A common used value forαis 0.003.Fig.5 is an illustration of the computed thresholds for a KPI.

The other types of KPI can be modeled by other time series models.For example,the data with significant randomness could be modeled by density estimation,rather than the Holt-Winter model.

The anomaly scoring model is critical for reducing the false alarm and can facilitate the O&M engineer to focus on impor?tant events.The detailed algorithm would not be demonstrated in the paper for the sake of space limit is a planning research topic in the future.

▲Figure 5.An illustration of time series modeling by Holt-Winters:(a)represents the true value (blue curve) and fitting value (green curve) in historical data;(b) represents the true value (blue curve)and the predict?ed thresholds(red curve)in the following day.

3.2 Anomaly Diagnosis with KPIs

When a KPI anomaly is detected,it is quite worthy to define the root causes for rapid fault recover.Fig.6 depicts the anomaly diagnosis method developed in this paper,which com?bines a rule-based diagnosis module and a ML-based diagnosis module to handle a wide range of scenarios.

As shown in Fig.6,when the detected anomaly is a known fault that can be explicitly diagnosed by predefined expert rules,the rule-based diagnosis module could define the root causes according to related information,such as the NE model in Fig.2,which contains the network topology,the exact math?ematical function between the KPI and related counter indica?tors (counter indicators are more basic performance data,com?paring to KPIs),and expert rule library.The rule-based mod?ule can generally output exact rule causes and provide direct execution suggestion for recovering.

When the detected anomaly is an unknown fault,the MLbased diagnosis module would define the root causes by using the partial least squares regression (PLS) algorithm as pro?posed in this paper.The PLS has been used in multivariate monitoring of processing operating performance,which is al?most in the same way as PCA-based monitoring [14].Instead of only finding hyper-planes of maximum variance for indepen?dent variables,PLS finds a linear regression model by project?ing the response variables and the independent variables to a new space.Compared to standard linear regression,PLS re?gression is particularly suitable when the dimension of re?sponse variables is more than independent variables and when there is multi-collinearity among independent variables.As il?lustrated in Fig.7,when an abnormal KPI is detected,PLS models the KPI as a response variable and the correlated coun?ter indicators as independent variables.Following the PLS modeling,the contribution analysis is conducted to find the top root counter indicators.

Denoting the data matrix of correlate counter indicators asXand the matrix of a KPI asY,the PLS model betweenXandYcan be formulated as:

whereTandUare projections ofX(theXscore,component or factor matrix) and projections ofY(theYscores),respectively;PandQare orthogonal loading matrices;and matricesEandFare the error terms.As the PLS model has only one response KPI,the PLS1 algorithm can be used for estimating theT,U,PandQ.And then,aT2statistic is used to represent the model status at each observation x as in[14]:

▲Figure 6.A mixed scheme that combines the rule-based and ML-based diagnosis modules for KPI anomaly diagnosis.

▲Figure 7.Root cause analysis with PLS model when a KPI is abnor?mal.

where Γ=(RΛ-1RT)1/2,Λ=andRis the rotation matrix forX.The contribution of thei-th independent vari?able,i.e.counter indicator,to theT2statistic is calculated as:

whereγiis thei-th row of Γ.The total contribution of theith counter indicator to the variation of the KPI can be calcu?lated as the sum ofC(T2,i)fromnobservations.The contribu?tions of all counter indicators are sorted,and the top-n coun?ter indicators are output as the root causes of the anomaly KPI.Fig.8 shows an experimental example,illustrating the contributions of 60 counter indicators to an anomaly KPI,downlink (DL) IP Throughput.The O&M expert confirms that the top counter indicator,C373597010:DL Used Control Channel Element (CCE) Average Number,is useful for the anomaly diagnosis,demonstrating the effectiveness of the pro?posed algorithm.

4 Conclusions

▲Figure 8.An experimental example of partial least squares regression (PLS)method for root cause analysis.

The research of intelligent O&M has attracted extensive in?terest for IT system in recent years,which is known as AIOps[15].However,this topic is relatively less discussed in wire?less networks.As the evolution of wireless networks and the emerging of 5G,the networks become more complicated,em?phasizing the disadvantage of manual operation and the desire to automate O&M process with intelligent analysis to handle such a challenge.In this paper,we try to formulate an intelli?gent operation system based on the layering concept,resulting in a flexible,scaling and manageable framework.And then,two practical use cases,the anomaly detection with KPIs data and the anomaly diagnosis of KPIs data,are studied based on the framework.A two-stage time series modeling method is proposed to construct the anomaly detection model,and a mixed scheme is proposed to the anomaly diagnosis.The real data experiments demonstrate the effectiveness of the proposed method,thus encouraging the further research for intelligent operation with ML technology.In the future,we would develop more use cases to resolve other operation issues in wireless net?work,for example the top-n cells analysis,the automated log analysis,the prediction analysis,and the optimal parameters configuration.

ZTE Communications2019年4期

ZTE Communications的其它文章: Editorial:Special Topic on Computational Radio Intelligence:One Key for 6G Wireless; Fiber-Wireless Integrated Reliable Access Network for Mobile Fronthaul Using Synclastic UniformCircular Array with Dual-Mode OAM Multiplexing; An Improved Non-Geometrical Stochastic Model for Non-WSSUS Vehicle-to-Vehicle Channels; A Survey on Network Operation andMaintenance Quality Evaluation Models; A Survey on Machine Learning Based Proactive Caching; Machine Learning Based Unmanned Aerial Vehicle Enabled Fog-Radio Access Network and Edge Computing