Handling Label Noise in Air Traffic Complexity Evaluation Based on Confident Learning and XGBoost

2021-01-27 08:09:30，*，，，

Transactions of Nanjing University of Aeronautics and Astronautics 2020年6期

，*，，，

1.College of Civil Aviation，Nanjing University of Aeronautics and Astronautics，Nanjing 211106，P.R.China；2.College of Computer Science and Technology/College of Artificial Intelligence，Nanjing University of Aeronautics and Astronautics，Nanjing 211106，P.R.China

（Received 8 June 2020；revised 1 July 2020；accepted 22 July 2020）

Abstract:Air traffic complexity is a critical indicator for air traffic operation，and plays an important role in air traffic management（ATM），such as airspace reconfiguration，air traffic flow management and allocation of air traffic controllers（ATCos）.Recently，many machine learning techniques have been used to evaluate air traffic complexity by constructing a mapping from complexity related factors to air traffic complexity labels.However，the low quality of complexity labels，which is named as label noise，has often been neglected and caused unsatisfactory performance in air traffic complexity evaluation.This paper aims at label noise in air traffic complexity samples，and proposes a confident learning and XGBoost-based approach to evaluate air traffic complexity under label noise.The confident learning process is applied to filter out noisy samples with various label probability distributions，and XGBoost is used to train a robust and high-performance air traffic complexity evaluation model on the different label noise filtered ratio datasets.Experiments are carried out on a real dataset from the Guangzhou airspace sector in China，and the results prove that the appropriate label noise removal strategy and XGBoost algorithm can effectively mitigate the label noise problem and achieve better performance in air traffic complexity evaluation.

Key words：air traffic complexity evaluation；label noise；confident learning；XGBoost

0 Introduction

With the air transport industry developing rapidly，the surging flight volume and limited airspace impose new challenges on the current air traffic management system and air traffic controllers（ATCos）.Many potential safety problems have been raised，such as airspace congestion，flight conflict，and high workload of ATCos.In order to safely regulate air traffic，airspace is divided into several smaller sectors which are in charge of ATCos.However，the ATCos resource is limited，so we need to allocate ATCos resources over different sectors reasonably through advanced techniques，such as resectorization or dynamic airspace configuration.The key to these techniques is to accurately evaluate air traffic complexity.

Air traffic complexity is a quantitative indicator to reflect the complexity of air traffic system operation pattern，the relationship between aircraft and uncertainty of evolutionary trend［1-3］.Evaluation of air traffic complexity is not easy because of the numerous complexity related factors and non-linear correlation contained in the formation of air traffic complexity［4］.

There are two main methods in the research of air traffic complexity evaluation［5］.The first one focuses on constructing a model or the most relevant indicator，such as conflict probability［6］，conflict resolution difficulty［7］， Lyapunov Exponent［8］， etc.However，as air traffic complexity contains large amounts of information and is embedded with sophisticated relationships，it is unrealistic to perfectly evaluate air traffic complexity by a single indicator or model.The principle of the other method is to consider as more complexity factors as possible to make a comprehensive description for air traffic complexity.The most famous one is the dynamic density method，which calculates complexity as the sum of various complexity factors with different weight［9］.Whereas，due to the inability in depicting non-linear relationship，the dynamic density method tends to get imprecise results in practice.Other nonlinear better methods were then put into use.In 2006，Gianazza et al.［10］introduced the idea that the air traffic complexity problem could be considered as a complexity level classification task.They used the backpropagation neural network（BPNN） to capture the non-linear relationship［10］.Later on，more and more advanced methods，such as adaptive boosting learning algorithm［11］and transfer-learning［12］， had been employed and acquired fruitful achievements in air traffic complexity evaluation.

All these existing machine learning-based complexity evaluation methods have one same premise assumption that the complexity labels evaluated by air traffic management（ATM）experts are definitely correct.But，in fact，some samples used by machine learning algorithms may have inaccurate labels，especially when the labels are provided by human［13-14］.Air traffic complexity labels marked by ATM experts also have some inaccurate labels.In 2019，Andrasi et al.［15］carried out a comparative experiment on air traffic complexity evaluation between neural network and linear model.Theoretically，neural network may get a better result for its more excellent ability in depicting non-linear relationship.However，the results showed that they only had small difference.The author of Ref.［15］ attributed the remaining error as intra-rater or interrater unreliability in human experts during labeling，which illustrated that the premise assumption of definitely correct labels may not be appropriate.Hence，we should pay more attention to the incorrect labels，named as label noise，and its impact on air traffic complexity evaluation.

In this paper，we propose a confident learning and XGBoost-based method to evaluate air traffic complexity under label noise，which has not been dealt with in the past.In our method，every sample is calculated in a cross-validation way to get the probability distributions of several classes through different classification algorithms.Under different label probability distributions，label noise detection and cleansing steps based on confident learning are carried out separately to produce several suspected label noise sets.Then these sets are integrated into one total set.Based on the total label noise set，we selectively remove different ratios of label noise samples from the original dataset to get datasets with different cleanliness.Finally，by comparing the performance of XGBoost and other classification algorithms on these cleansed datasets，the optimal label noise removal ratio and the corresponding classification algorithm can be obtained for final air traffic complexity evaluation.

1 Problem Description

This section gives a description of evaluating air traffic complexity by machine learning methods and the problem of label noise we encountered.

Our objective is to evaluate air traffic complexity from a variety of complexity related factors.More specifically，we have its real traffic operational data for every air traffic scenario，such as aircraft speed，heading，longitude，latitude，altitude and so on.According to previous research studies，these data are transformed into complexity related factors to describe air traffic complexity.Air traffic complexity level provided by ATM experts is collected when the real traffic operational data are generated.In the machine learning field，these complexity related factors and air traffic complexity level are known as features and labels，respectively.Based on complexity related features and label information，many scholars have carried out plenty of air traffic complexity researches under machine learning framework［3，5，11，12，15-17］.The main idea is to construct a mapping model between these features and the complexity labels.Then，we can use the model to intelligently predict air traffic complexity without ATM experts，when new air traffic data are coming.The whole process is displayed in Fig.1.

In the area of supervised machine learning，many benchmark datasets have label noise，which also happens on air traffic complexity datasets.The reason is summarized as human expert inconsistency，which consists of low intra-rater and inter-rater reliability，by Andrasi in 2019［15］.The raters are referred to ATM experts who mark the complexity in the given traffic situations.Intra-rater reliability is the degree of agreement among multiple ratings by a single rater，while intra-rater reliability is the degree of consistency between multiple raters.For instance，even the same traffic situations may be rated with different complexity labels in different circumstances，which induces the problem of label noise.

Label noise may obscure the relationship between the features of a sample and its labels，so as to impact the classification performance of classifiers.Some researchers were aware of the problem，so they managed to get the high quality complexity labels by integrating the ideas of different experts on the same air traffic scenario or conducted more completed and strict process management to alleviate the problem［17］.However， these solutions cannot completely solve the problem of label noise and may even waste limited labeling resources.Therefore，this paper puts forward a label noise sample detection and removal strategy to handle the label noise problem in the evaluation of air traffic complexity.

2 Methodology

2.1 Air traffic complexity representation

Various factors can influence the level of air traffic complexity and have drawn much attention in air traffic complexity research.Kopardekar et al.［9，18-19］has identified nearly 40 air traffic complexity factors since 1963.Delahaye et al.［20］described the intrinsic attributes of air traffic by relative position and relative speed of aircraft pairs，and then constructed a traffic disorder model to analyze the complexity.Lee et al.［21］emphasized heading change of aircraft in response to intrusive aircraft within a sector to calculate air traffic complexity.The probabilistic factor was put forward to measure the midterm traffic complexity by Prandini et al.［6］.In this paper，we choose 24 complexity factors that have been consistently found to be relevant to air traffic complexity.All of the factors are the features we use in the later machine learning process.Their definitions are listed in Table 1 and more detailed information can refer to Refs［.20，22-23］.

2.2 Label noise detection by confident learning

To handle the label noise problem in machine learning，there are two main solutions.

（1）Algorithm level：Construct a robust classifier to resist the impact of label noise.

（2）Data level：Detect and remove label noise to get a clean dataset for training.

We will start from the data level，which is most commonly used in applications because of its convenience and effectiveness.

Confident learning is an approach for characterizing，identifying and learning with noisy labels based on the principles of pruning noisy data，counting to estimate noise and ranking examples to train with confidence［24］.It uses probabilities and noisy labels to count examples in the unnormalized confident joint，estimate the joint distribution and prune noisy data.Only two inputs are needed：Out-ofsample predicted probabilities and array of noisy labels.This method requires no hyperparameters and will output ordered samples according to their label noise probabilities.The whole process is shown in Fig.2.

Fig.2 Label noise detection by confident learning

The introduced label collision is handled by selectingTherefore，in the following formulas，the confident jointis defined as

wherej=argmax only matters when，diagonal entries ofcount correct labels and nondiagonals capture asymmetric label error counts.

Following the estimation of the joint，pruning，ranking and other heuristics are applied for cleaning dataset.is used to estimate the number of label errors and remove errors by ranking over predicted probability.The prune method is based on the noisy rate， whereexamples were selected with max marginfor each off-diagonal entry inOnce label noise is found，we start to train our model with errors removed.

2.3 XGBoost classification algorithm

XGBoost is short for“extreme gradient boost-ing”，which is designed to be a scalable machine learning system for tree boosting［25］.The parallel tree boosting and regularization strategy enable it to run in a much faster way and achieve state-of-the-art results in many machine learning problems.As an ensemble method，the basic idea of XGBoost is to combine several weak models into a strong one，which can be presented as

wherefk(·)is a weak model andKthe number of weak models.

As a tree boost，the core of XGBoost is the Newton boosting，which searches the optimal parameters by driving the objective function as Eq.（8）towards the minimum direction.

wherelis the loss function andΩthe regularized term.They measure the performance and control the complexity of the model.

The ensemble model works better in an addictive manner.ftis added to improve the model and the new objective function is formed as

whereis the prediction of theith sample andftthe weaker model at thetth iteration.

Then，the second-order approximation is used to speed up the optimization procedure to obtainwhich changes the objective function into

wheregiandhiare the first and second order gradient statistics of the loss function.For a fixed tree structure，the optimal weightωand the corresponding optimal splitting point can be found.

Besides the improvements in the regularized objective，several additional techniques are also used to promote the classification performance，such as overfitting prevention，computation enhancement and so on.More details can be found in Ref.［25］.

Considering the mentioned advancements and excellent performance in applications，XGBoost is adopted for our air traffic complexity evaluation under label noise.

2.4 Integrated model based on confident learning and XGBoost

The label noise solution used in this paper is to filter out the noisy label samples，and then train the classification model on the clean dataset.There may be three remained problems：

（1）When detecting and removing noisy label samples，the right labeled samples which are difficult to distinguish may be wrongly deleted.

（2）After removing label noise samples massively，the training data will be severely reduced，which may cause an under-fitting problem.

（3）An imbalanced problem might be intensified for the original imbalance dataset，which is exactly our case.Some minority categories may have fewer samples and even disappear after removing step.

To deal with these problems，we design a novel framework including label noise removal strategy and XGBoost algorithm，as shown in Fig.3，where CL represents the confident learning method used to calculate the noise value according to label probability distributions.Firstly，we adopt several classifiers instead of a single one to acquire different label probability distributions of each sample，so that more general and various label noise information can be offered to the confident learning to detect more extensive label noise.Then several label noise sets are generated.Next step is to incorporate these label noise sets into an overall set that contains as much as label noise samples.Before that，we need to set two indicatorsNSTandNVto reflect the noisy level.They are defined as follows

Fig.3 Framework of integrated model

wherej，mdenote thejth classifier and the number of classifiers，respectively.Sjrepresent the label noise sets selected by confident learning method undercorresponding labelprobability distribution，which are generated by thejth classifier.NSTijrepresents the times that theith sample is selected bySj.IDijdenotes theith sample’s sequence inSjaccording to the noise probability.Lenjis the length ofSj.NVi，calculated byIDijandLenj，represents the noisy level of theith sample.

In the overall label noise set，each sample corresponds to respectiveNSTandNV.We remove different ratios of label noise samples from the dataset to get different cleanliness datasets for XGBoost algorithm，which are robust to weak label noise datasets.It is worth noting that the minority category should be kept from removing for the balance of datasets.Finally，by comparing the performance on different cleanliness datasets，we can find the optimal label noise removal ratio for XGBoost.

This section discusses the computational complexity.Our proposed integrated method mainly consists of three parts：（1）Getting label probability distribution from classification algorithms，（2） inputting the label possibility distribution into confident learning algorithm to detect label noise samples，and（3） adjusting removal ratio of different label noise samples to acquire optimal performance in XGBoost.Therefore，the computational complexity of our method could be divided into the classification algorithm complexity and the confident learning algorithm complexity.According to Eqs.（1）—（6）and detailed proof in Ref.［24］，the computational complexity of confident learning isO（c2+nc），wherecandndenote the number of classes and samples［24］.And for classification algorithms，XGBoost has the greatest computational complexityO（Kdmnlogn），whereK，d，mare the number of trees，the depth of trees and features，respectively.To sum up，the computational complexity of our method isO（c2+nc+Kdmnlogn）.

3 Experiments and Results

3.1 Dataset and evaluation metrics

All the experiments are executed on the real air traffic operation data collected by automatic devices in Guangzhou region，China.Each record contains flight callsign，SSR code，longitude，latitude，altitude，speed，aircraft type，etc.The yellow part in Fig.4 is the airspace sector we focused on，which is located in the main air route from Guangzhou to Wuhan.From December 1st to December 15th in 2019，we collected 2 769 samples of this sector with each sample corresponding to a one-minute air traffic scenario.The dataset has 24 complexity factors as its features，shown in Table 1，and a complexity level（five ordinal levels） obtained from ATM experts as its label.A dataset with 200 samples is purposely selected as a test set in order to maintain a unified baseline in experiments.The complexity labels of this dataset are thought to be clean and do not operate label noise removal process，as they are provided by several reliable ATM experts.

Fig.4 Target airspace sector structure

To verify the performance of the proposed method，we select accuracy，mean absolute error（MAE）and mean absolute error with ordinal penalty（MAE-ordinal）as the evaluation metrics，shown as

3.2 Effectiveness verification of label noise removal strategy

We have definedNSTandNVto reflect the level of label noise.These label noise samples can be detected by confident learning when inputting sample’s labels and class probability distributions.Considering the robust and diversity of label noise datasets，we carry out several label noise cleansing tests under different label probability distributions，which are generated by some classifiers such as support vector machine（SVM），random forest（RF），logistic regression（LR），neutral network（NN），and XGBoost（XGB）.Then we integrate the filtered label noise sets to form a comprehensive one，shown in Table 2.In Table 2，every row represents a label noise sample and relevant noise information.Noise sample ID is the index of corresponding label noise samples in the original dataset.

Table 2 Noise level of each label noise sample

NSTandNVof label noise samples are shown in Fig.5，which demonstrates that they have strong positive correlation.It means that the more frequently a sample is selected as label noise samples by confident learning，the bigger the noise value is.A sample with bigger noise value is more likely to be a noise.Therefore，in order to verify the effectiveness of label noise removal strategy，we firstly decide to delete label noise samples （about 621 samples）whoseNSTis equal to the number of classifiers，to obtain a clean dataset.

Fig.5 NST and NV of label noise samples

By inputting the original dataset and cleansed dataset into the classification algorithms，we can observe the effect of label noise removal strategy from the results.Moreover，we also calculate another cleansed dataset by original confident learning method called orginal-CL，with the aim to compare the performance with our integrated method called rectified-CL.Above resluts are shown in Table 3，from which we can conclude that the presence of label noise actually has an impact on both evaluation metric accuracy and MAE.The performance of other classification algorithms is all improved after label noise removal strategy except LR.Especially for Adaboost（“Ada”for short in Table 3），its accuracy increases almost by 12%，and MAE drops from 0.475 to 0.300 in rectified-CL.These all show the significant influence of removing label noise samples.On the other hand，our rectified-CL results are generally better than the original-CL results.The optimal performance was obtained by XGBoost with accuracy up to 80.00%and MAE of 0.242.

Table 3 Performance comparison under different strategies

3.3 Label noise removal ratio

In this section，we will study the influence of different removal ratios in detail.Similar with the former section，we use the five classification algorithms，i.e.LR，NN，RF，Adaboost and XGBoost.The parameters of each algorithm are set as optimal values in different label noise removal ratios.The results are shown in Fig.6，and from Fig.6，we can get that：

（1）Different label noise removal ratios have different experimental results.The best result does not lie on the highest label noise removal ratio，but the middle.That means over cleansing may decrease the performance of classifiers，because many right samples may be wrongly removed and a smaller dataset may lead to an under-fitting problem.

（2）For the low label noise removal ratio（less than 30%），LR performs better than NN both in accuracy and MAE.When the label noise removal ratio becomes greater than 30%，the performance of LR is surpassed by NN.But they tend to be consistent as the removal ratio increases.This phenomenon reveals that NN is more easily to be affected by label noise samples in high label noise level，which is the truth in most machine learning problems.For example，compared with the linear model，Ref.［15］attributed the mediocre performance of NN to low intra-rater and inter-rater reliability in human experts，which is exactly the impact of label noise.

（3）Comparing the results of RF with that of Adaboost，we can find that the performance of Adaboost is extremely poor at first under a large number of label noise samples，but it rises rapidly when more label noise samples are removed.Their performance becomes similar when the label noise removal ratio exceeds 60%.The bagging algorithm can usually get better results than the boosting algorithm under label noise samples，because more weights will be put on these misclassified samples in boosting learning to induce worse performance，while they are actually noisy samples.

（4） In general，when the label noise removal ratio is less than 60%，XGBoost and RF all show obvious and stable advantages on air traffic complexity evaluation.The optimal result with the accuracy of 81.67%and the MAE of 0.233 is achieved by the XGBoost algorithm with the label noise removal ratio of 40%.We can conclude that combining an excellent algorithm with appropriate label noise removal strategy may achieve better result.

In order to observe the ultimate performance of the classifiers，we calculate the optimal results in Table 4.We can find that XGBoost with a label noise removal ratio of 40%attains the greatest accuracy and MAE，nevertheless its optimal MAE-ordinal is achieved under label noise removal ratio of 10%.Similarly，performance of the other algorithms gets the optimum when they generally correspond to different label noise removal ratios.This phenomenon reminds us that the classifiers have different processing methods to deal with the label noise.That means it is almost impossible to get an identical label noise ratio suitable for all classifiers or evaluation metrics.Therefore，we should take the label noise removal ratio as an adjustable parameter in future air traffic complexity evaluation process to seek the best performance we expect.

Table 4 Optimal performance in different classifiers

4 Conclusions

In this paper，we firstly consider the label noise problem in air traffic complexity evaluation and propose a confident learning and XGBoost-based method to evaluate air traffic complexity under label noise.In the process of label noise cleansing，the label noise dataset is filtered when labels and their probability distributions are input.In order to contain more label noise information，we calculate several label probability distributions by using some classification algorithms and then incorporate them to form an overall label noise dataset.We define two indicators，NSTandNV，to reflect the noisy level of each sample.Label noise samples are then removed by different ratios according to their noisy level to obtain datasets with certain cleanliness.Finally，we run classifiers on these datasets to get the best performance.The experimental results verify the effectiveness of the label noise removal strategy，and the accuracy of 81.67%（MAE of 0.233）is achieved by the XGBoost algorithm under label noise removal ratio of 40%.

The proposed method can be used for supporting airspace sector partition，dynamic airspace and air traffic flow management，etc.We will construct more complexity related features for describing air traffic complexity and carry out suitable features selection to eliminate redundant features to achieve better results in future study.

Transactions of Nanjing University of Aeronautics and Astronautics2020年6期

Transactions of Nanjing University of Aeronautics and Astronautics的其它文章: Solving Severely Ill-Posed Linear Systems with Time Discretization Based Iterative Regularization Methods; Study on Ultrahigh Cycle Fatigue Performance of GH4169 Nickel-Based Alloy at 650℃; Equivalent Stiffness of Metal Clip-Like Piezoelectric Spring Structure; Multidisciplinary Design Optimization of Crash Box with Negative Poisson’s Ratio Structure; Filtering Method of Aero-engine Load Spectrum Based on Rain Flow Counting; Collaboration Optimization of Flight Schedule in Beijing-Tianjin-Hebei Airport Group