Medical Feature Selection Approach Based on Generalized Normal Distribution Algorithm

2021-12-15 07:06:32MohamedAbdelBassetRedaMohamedRiponChakraborttyMichaelRyanYunyoungNamandMohamedAbouhawwash

Computers Materials&Continua 2021年12期

Mohamed Abdel-Basset,Reda Mohamed,Ripon K.Chakrabortty,Michael J.Ryan,Yunyoung Namand Mohamed Abouhawwash

1Faculty of Computers and Informatics,Zagazig University,Zagazig,44519,Egypt

2Capability Systems Centre,School of Engineering and IT,UNSW,Canberra,Australia

3Department of Computer Science and Engineering,Soonchunhyang University,Asan,31538,Korea

4Department of Mathematics,Faculty of Science,Mansoura University,Mansoura,35516,Egypt

5Department of Computational Mathematics,Science,and Engineering(CMSE),Michigan State University,East Lansing,48824,MI,USA

Abstract:This paper proposes a new pre-processing technique to separate the most effective features from those that might deteriorate the performance of the machine learning classifiers in terms of computational costs and classification accuracy because of their irrelevance,redundancy,or less information;this pre-processing process is often known as feature selection.This technique is based on adopting a new optimization algorithm known as generalized normal distribution optimization(GNDO)supported by the conversion of the normal distribution to a binary one using the arctangent transfer function to convert the continuous values into binary values.Further, a novel restarting strategy(RS)is proposed to preserve the diversity among the solutions within the population by identifying the solutions that exceed a specific distance from the best-so-far and replace them with the others created using an effective updating scheme.This strategy is integrated with GNDO to propose another binary variant having a high ability to preserve the diversity of the solutions for avoiding becoming stuck in local minima and accelerating convergence,namely improved GNDO (IGNDO).The proposed GNDO and IGNDO algorithms are extensively compared with seven state-of-the-art algorithms to verify their performance on thirteen medical instances taken from the UCI repository.IGNDO is shown to be superior in terms of fitness value and classification accuracy and competitive with the others in terms of the selected features.Since the principal goal in solving the FS problem is to find the appropriate subset of features that maximize classification accuracy,IGNDO is considered the best.

Keywords: Generalized normal distribution optimization; feature selection;transfer function; novel restarting strategy; UCI repository

1 Introduction

In the last few years, the dimensionality of problem features has significantly grown which has negatively affected several fields such as data mining, data science, big data, and several others.High-dimensional features cause a number of problems during analysis complexity, dimensionality,and sparsity [1], in addition to reducing classification accuracy and increasing the cost of the machine learning techniques.Researchers therefore seek a feature-selection technique that is able to select the most effective subset of features which maximize classification accuracy and minimize expensive computation [2].Of the feature extraction approaches available, filter-based approaches evaluate the features independently but suffer from the issues of local minima and computation time.On the other hand, a wrapper-based approach has received significant interest from researchers due to better classification quality as a result of using a machine learning technique to evaluate the quality of the extracted features.

In a wrapper-based approach, evolutionary algorithms and meta-heuristic algorithms are used as optimizers to select the best subset of features.It is also worth mentioning that there are several traditional techniques for tackling this problem, including greedy search, and random search,but those techniques suffer from becoming stuck in local minima and have expensive computational costs, so researchers have moved toward the evolutionary and meta-heuristic algorithms.In particular, meta-heuristic algorithms have received significant attraction due to their strengths in achieving better outcomes in a reasonable time for several optimization problems [3–8].The remainder of this section reviws of the major published meta-heuristic and evolutionary feature selection techniques.

In [9], a new feature selection technique based on the improved Harris hawks optimization algorithm (IHHO) using the opposition-based mechanism and a new search mechanism was proposed by Sihwail et al.[9].IHHO can avoid local minima and subsequently improve the quality of the solutions by accelerating convergence providing superior performance to the standard HHO.IHHO was verified on 20 datasets and was shown to be better than (in terms of fitness values,accuracy, and extracted features) than a number of other optimizers:grasshopper optimization algorithm (GOA), whale optimization algorithm (WOA), generic algorithm (GA), particle swarm optimization (PSO), butterfly optimization algorithm (BOA), ant lion optimizer (ALO) and slime mould algorithm (SMA).

Bhattacharyya et al.[10] combined the mayfly algorithm with harmony search for FS, in an approach called mayfly-harmony search (MA-HS).MA-HS improves exploitation capability by avoiding local minima to achieve better outcomes.MA-HS was shown to have better performance than 12 optimization techniques verified on 18 UCI datasets.The sailfish (BSF) optimizer improved usingβ-hill climbing (AβHC) meta-heuristic algorithm was proposed by Ghosh et al.[11] who used 18 UCI datasets to verify the performance, showing that its performance was superior to 10 state-of-the-art algorithms.

In [12], the quantum whale optimization algorithm (QWOA) has been recently proposed for the FS problem.QWOA was compared with 8 well-known optimization algorithms to see its effectiveness in extracting the best subset of features for 14 datasets from diversified domains.Furthermore, Abdel-Basset et al.[13] proposed a binary version of the improved HHO (HHOSA)for FS which used the bitwise operator and simulated annealing with HHO to avoid local minima,improving the quality of the solutions.HHOSA was verified using 43 instances and was shown to be superior to some well-established algorithms.

A new variant of the grey-wolf optimizer (GWO) hybridized with a two-phase mutation(TMGWO) was proposed by Abdel-Basset et al.[14] to improve the exploitation capability of GWO.TMGWO was shown to be superior to a number of well-known optimization algorithms such as flower algorithm, particle swarm optimization algorithm, multi-verse optimizer algorithm,whale optimization algorithm, and bat algorithm.

Several other meta-heuristic algorithms have recently been adapted for the FS problem such as binary dragonfly algorithm [15], brainstorm optimization [16], whale optimization algorithm [17],genetic algorithm [18], enhanced Harris hawks optimization [19], harmony search and naked molerat algorithms [20], cosine similarity-based harmony search algorithm [21], modified harmony search algorithm [22], and several others [2].

Recently, a new optimization algorithm has been proposed for tackling the parameter estimations problem of the solar cell and photovoltaic module models [23].This algorithm was called the generalized normal distribution optimization algorithm (GNDO) since it was inspired by the normal distribution theory.GNDO can estimate the parameter values that minimize the sum of squared error between the I-V measured and I-V estimated, showing its effectiveness in avoiding local minima.To the best of our knowledge, the effectiveness of this algorithm in tackling binary problems such as knapsack and FS is not yet known.Therefore, in this paper, two variants of GNDO are proposed for FS.The first variant applies the standard GNDO transformed using the arctangent transfer function; the second uses a novel strategy known as a novel restarting strategy(RS) to preserve the diversity among the members of the population.RS searches for the solutions that exceed a specific critical distance from the best-so-far solution; then, a novel updating scheme is used to update those solutions to preserve the diversity by improving the quality of the solutions.This RS is integrated with the standard GNDO to propose a new binary variant called IGNDO.The proposed GNDO and IGNDO algorithms are experimentally verified using 13 UCI instances and compared with seven well-known recently-published binary optimization algorithms,namely TMGWO [14], non-linear particle swarm optimization algorithm (NLPSO) [24], WOA [25],marine predators algorithm (MPA) [26], equilibrium optimizer (EO) [27], binary slime mould algorithm integrated with a novel attacking-feeding strategy (FMBSMA) [2], and HHOSA [13].The experimental outcomes confirm the superiority of the proposed IGNDO and GNDO algorithms in terms of classification accuracy and fitness values, and show them to be competitive in terms of the number of selected features.Further, IGNDO outperforms GNDO in terms of the number of selected features, classification accuracy and average fitness value.

The remainder of this paper is arranged as follows.Section 2 describes the methods used and the proposed algorithm; Section 3 presents outcomes and discussions; and Section 4 draws conclusions and introduces intended future work.

2 Methods and Proposed Algorithm

2.1 Generalized Normal Distribution Optimization

A novel optimization algorithm [23] called generalized normal distribution optimization(GNDO) has been recently proposed for tackling the nonlinear optimization problems, specifically the solar cell parameters estimation problems.This algorithm was inspired by the normal distribution theory and is based on the two main stages of optimization methods:exploration and exploitation.In the exploration stage, the algorithm works on exploring the search space to find the most promising region, which might involve the optimal solution.The latter stage focuses on this region to reach the optimal solution.In the remainder of this section, those two stages in GNDO are explained in detail.

2.1.1 Local Exploitation

The local exploitation stageconsiders the meanμiof three selected solutions—the best-sofar solutionX*, the position vector of theithsolutionXit, and the meanMof the solutions calculated using Eq.(3)—as the promising region in the current generationt, which is mathematically calculated by Eq.(2).Then, it searches around this promising region based on a step size generated by Eq.(4) to generate a new trial solutionTitas described in Eq.(1), which might be better than the current one.This trial solution will be compared with the current one, and if it is better, it will be used in the next generation.

Nis the population size.η, which is mathematically modeled in Eq.(5), is the penalty factor.

r1,r2,, andare four numbers randomly created between 0 and 1.

2.1.2 Global Exploration

In the global exploration phase, the search space of the optimization problem will be intensively explored to identify the most promising region that might involve the optimal solution.This phase is mathematically formulated as follows:

p1,p2, andp3 are indices picked randomly from the solutions, such thatp1p2p3.Exchanging between the exploration and exploitation in GNDO is undertaken randomly.

2.2 Proposed Algorithm:Improved GNDO(IGNDO)

This section describes a novel restarting strategy that is used to improve GNDO in its performance to estimate the number of features that might maximize the classification accuracy of the machine learning technique; this proposed algorithm is called improved GNDO (IGNDO).

2.2.1 Initialization

To start,Nsolutions withddimensions indicate the feature length.Each solution is created and initialized randomly with a binary value:0 to distinguish the unselected features and 1 for the selected ones.Those initialized solutions will then be evaluated as explained in the next section.

2.2.2 Evaluation

The solutions of the FS problem are evaluated using two objectives:the number of the selected features, and the classification accuracy based on those selected features.In [28], an objective function was proposed to relate between those two conflicting objectives based on a weighting variableα, which is a value between 0 and 1, that might pay attention toward one objective at the expense of the other according to the need of the decision makers.In this problem, the main objective is to maximize the classification accuracy, even if the number of features is still high.Therefore, the weighting variable will be moved toward maximizing the classification accuracy of the selected features.This function is mathematically formulated according to that:

γR(D)indicates the classification error rate obtained according to the extracted features used to train the k-nearest neighbor classifier (KNN), |S|is the selected features length, and |L|is the feature-length in the studied instance.In our work, each dataset is divided into two parts based on the holdout method [29]:the first part will be used as a training dataset and represents 80%of the original dataset, while the other is used as a test dataset.

2.2.3 Transfer Function

Unfortunately, the solutions created by GNDO are continuous, not binary, which means that they cannot be used as solutions to this problem.Consequently, a transfer function of the VShaped family, namely arcTan described in Eq.(10), has been used to normalize the continuous values between 0 and 1, and Eq.(11) is then used to convert those values to binary values.

2.2.4 A Novel Restarting Strategy

In this section, a new strategy, known as restarting strategy (RS), is proposed to avoid local minima, which affect several optimization algorithms.This strategy calculates the distance between the fitness of the current solution and the best-so-far distance; if this distance exceeds a specific limit, the solution will be restarted within the search space of the problem using a novel updating scheme.At the start, this strategy calculates the distance between the fitness of the current solutionXitand the best-so-far oneX*as follows:

where?is the normalized distance.Then, if?is greater than a predefined thresholdρ,Xitwill be updated using the following scheme estimated based on our experiments:

wherer1is a random number created based on the uniform distribution,l1andl2are two real values generated using the Lévy-flight strategy.Last but not least, this novel strategy is integrated with GNDO to propose a new variant, namely improved GNDO (IGNDO), to solve the FS problem.The steps of IGNDO are presented in Algorithm 1.

Algorithm 1:IGNDO Output:return X*1.Input:N, ρ, tmax:Maximum number of evaluations 2.t=0 3.Initialization phase.4.While t ＜tmax 5.For i=1: N 6.Create a random number α within [0, 1]7.Create a random number α1 within [0, 1]8.If α ＞α1 9.Calculate M using Eq.(3)10.Compute δi,μi,andη 11.Compute Tit using Eq.(1).12.If f(Tit)＜f(Xit)13.Xit=Tit 14.End If 15.Else 16.Compute Tit according to Eq.(6).17.If f(Tit)＜f(Xit)18.Xit=Tit 19.End If 20.t++;21.End If 22.Calculate ?using Eq.(13)23.If ?＞ρ 24.Update Xit using Eq.(14).25.End if 26.End For 27.End while

3 Experiments and Discussion

The performance of our proposed algorithm was verified by 13 instances with various feature lengths selected from the UCI machine learning repository [30].The description of this dataset is briefly presented in Tab.1, which has five columns:record id (ID), instance, feature lengths (F),number of classes (C), and number of samples (S).

Table 1:Dataset description

In addition, under the same environment settings, the proposed algorithm is compared with eight robust optimizations algorithms—TMGWO [14], NLPSO [24], WOA [25], marine predators algorithm (MPA) [26], equilibrium optimizer (EO) [27], HHOSA [13], FMBSMA [2], and GNDO—implemented using the Java programming language under the same parameters values cited in the original paper.However, IGNDO has one parameter,ρ, that needs to be estimated accurately to maximize its performance.Therefore, extensive experiments were conducted with various values for this parameter, which show that the performance of this algorithm is maximized when ρ=0.9999.In our experiment,σis set to a value of 0.99 to pay more attention to classification accuracy.All algorithms were evaluated under the same number of function evaluations,population size, and the number of runs, which were of 1000, 20, and 30, respectively.

The classification accuracy (ACC), fitness values (FV), and selected feature numbers (SFN)were used as performance metrics to evaluate the performance of the algorithms under various statistical analyses:average (Avg), standard deviation (SD), and boxplot.

4 Results and Discussion

In this section, the proposed algorithms:IGNDO and GNDO are compared with the others in terms of the average of FV, the average of ACC, the average of SFN, and the average of SD for values of each performance metric within 30 independent trials.After running each algorithm for 30 independent trials, the average of FV, ACC, and SFN are calculated and presented in Fig.1, which show the superiority of the proposed IGNDO and GNDO algorithms over the others in terms of ACC and FV, whereby IGNDO occupies the first rank with values of 0.095 and 0.908, respectively.Unfortunately, IGNDO is in fourth rank after HHOSA, MPA, GNDO,and FMBSMA in terms of SFN.However, the main objective in solving the FS problem is to find the subset of features that maximizes the classification accuracy with as few features as possible.Since IGNDO and GNDO outperform the other techniques in terms of classification accuracy,they are deemed to be the best.

In terms of the stability of the algorithms, Fig.2 shows the average of SD for FV, ACC and SFN values within 30 independent runs, from which it is clear that IGNDO is the best in terms of SD for FV and ACC.However, HHOSA is the best in terms of SD for SFN values within 30 independent runs.

Figure 1:Comparison among algorithms in terms of average of FV, ACC, and SFN

Figure 2:Comparison among algorithms in terms of average SD for FV, ACC, and SFN values

Figs.3–8 shows the boxplot of the fitness values within 30 independent runs for six instances with the highest dimensions:ID#1, ID#2, ID#3, ID#9, ID#12 and ID#13.The figure shows that the proposed algorithms are better for the worst, mean and SD values in comparison with the other algorithms in most test cases.However, IGNDO does not outperform GNDO in terms of the best values for ID#1, and ID#2.Furthermore, in terms of the classification accuracy,Figs.9–14 presents the values of this performance metric obtained by the different compared algorithms within 30 independent runs.It is clear that IGNDO is more stable than the others, in addition to its significant ability in reaching better values for the mean and the minimum in most test cases depicted in Figs.9–14.This stability and superiority in most test cases for IGNDO are due to the novel restarting strategy, that enables the proposed algorithm to avoid local minima.

Figure 3:Boxplot for ID#1 in terms of fitness values

Figure 4:Boxplot for ID#2 in terms of fitness values

Figure 5:Boxplot for ID#3 in terms of fitness values

Figure 6:Boxplot for ID#9 in terms of fitness values

Figure 7:Boxplot for ID#12 in terms of fitness values

Figure 8:Boxplot for ID#13 in terms of fitness values

Figure 9:Boxplot for ID#1 in terms of accuracy values

Figure 10:Boxplot for ID#2 in terms of accuracy values

Figure 11:Boxplot for ID#3 in terms of accuracy values

Figure 12:Boxplot for ID#9 in terms of accuracy values

Figure 13:Boxplot for ID#12 in terms of accuracy values

Figure 14:Boxplot for ID#13 in terms of accuracy values

Figure 15:Convergence curve for IGNDO and GNDO on ID#1

In Figs.15–24, IGNDO and GNDO are compared with each other regarding convergence speed in the direction of the near-optimal solution.Those figures show that IGNDO is faster than GNDO in reaching a lower fitness value in all observed test instances.This superiority is due to the ability of the proposed strategy to replace the unbeneficial solutions with others, exploring other regions within the search space of the problem that are unable to be reached by the standard GNDO.

Figure 16:Convergence curve for IGNDO and GNDO on ID#2

Figure 17:Convergence curve for IGNDO and GNDO on ID#3

Figure 18:Convergence curve for IGNDO and GNDO on ID#4

Figure 20:Convergence curve for IGNDO and GNDO on ID#7

Figure 21:Convergence curve for IGNDO and GNDO on ID#8

Figure 22:Convergence curve for IGNDO and GNDO on ID#10

Figure 23:Convergence curve for IGNDO and GNDO on ID#12

Figure 24:Convergence curve for IGNDO and GNDO on ID#13

Fig.25 shows the computational cost (in milliseconds) of the various algorithms in for extracting the optimal features of ID#12.From this figure, it is obvious that the computational times of both IGNDO and GNDO are almost equal and are superior to all the other algorithms.Consequently, our proposed algorithms are best in terms of classification accuracy, computational cost, and fitness values.However, some of the other algorithms are better regarding the selected features number and worst for the classification accuracy.Since the main objective of machine learning techniques is better classification accuracy regardless of the training time, our proposed IGNDO and GNDO algorithms:are strong alternatives to existing techniques.

Figure 25:Computational cost of the algorithms on ID#12

5 Conclusion and Future Work

Generalized normal distribution optimization (GNDO) is a novel recent optimization algorithm that has high performance for accurate and efficient estimation of the unknown parameters of the single-diode model and double-diode model of the photovoltaic systems.This significantly better performance motivates us to propose a binary variant for tackling the FS problem to find the subset of features that maximize classification accuracy and minimize the computational cost of machine learning techniques.The arc tangent transfer function is used to transform the continuous values produced by GNDO into binary values to be relevant tosolving the FS problem.Furthermore, a novel restarting strategy is proposed in this paper to re-initialize the solutions that are close to the best-so-far solutions as an attempt to preserve the diversity of the solutions to avoid local minima while accelerating convergence.In addition, a new binary variant of GNDO improved using RS is proposed for FS.IGNDO and GNDO are validated on thirteen instances taken from the UCI repository and compared with seven state-of-the-art feature selection techniques.IGNDO is shown to be superior in terms of classification accuracy and fitness value,and is competitive for the number of the selected features.

Our future work includes testing the performance of this novel strategy with some of the state-of-the-art algorithms in an attempt to identify better solutions for the FS problem.

Ethical Approval:This article does not contain any studies with human participants or animals performed by any of the authors.

Funding Statement:This work has supported by the National Research Foundation of Korea(NRF) grant funded by the Korea government (MSIT) (No.NRF-2021R1A2C1010362) and the Soonchunhyang University Research Fund.

Conflicts of Interest:The authors declare that they have no conflicts of interest to report regarding the present study.

Computers Materials&Continua2021年12期

Computers Materials&Continua的其它文章: An Improved Machine Learning Technique with Effective Heart Disease Prediction System; Transfer Learning Model to Indicate Heart Health Status Using Phonocardiogram; Swarming Behavior of Harris Hawks Optimizer for Arabic Opinion Mining; Hybrid Neural Network for Automatic Recovery of Elliptical Chinese Quantity Noun Phrases; Intelligent IoT-Aided Early Sound Detection of Red Palm Weevils; Emergency Decision-Making Based on q-Rung Orthopair Fuzzy Rough Aggregation Information