Xing Zhu, Jin Chu, Kngd Wng, Shifn Wu, Wei Yn, Kiefer Chim
a School of Civil and Environment Engineering, Nanyang Technological University, 618798, Singapore
b State Key Laboratory of Geohazard Prevention and Geoenvironment Protection, Chengdu University of Technology, Chengdu, 610059, China
c Building and Construction Authority, 200 Braddell Road, 579700, Singapore
Keywords:Rockhead Machine learning (ML)Probabilistic model Gradient boosting
ABSTRACT The spatial information of rockhead is crucial for the design and construction of tunneling or underground excavation. Although the conventional site investigation methods (i.e. borehole drilling) could provide local engineering geological information, the accurate prediction of the rockhead position with limited borehole data is still challenging due to its spatial variation and great uncertainties involved.With the development of computer science, machine learning (ML) has been proved to be a promising way to avoid subjective judgments by human beings and to establish complex relationships with mega data automatically. However, few studies have been reported on the adoption of ML models for the prediction of the rockhead position. In this paper, we proposed a robust probabilistic ML model for predicting the rockhead distribution using the spatial geographic information. The framework of the natural gradient boosting(NGBoost)algorithm combined with the extreme gradient boosting(XGBoost)is used as the basic learner.The XGBoost model was also compared with some other ML models such as the gradient boosting regression tree (GBRT), the light gradient boosting machine (LightGBM), the multivariate linear regression(MLR),the artificial neural network(ANN),and the support vector machine(SVM). The results demonstrate that the XGBoost algorithm, the core algorithm of the probabilistic NXGBoost model,outperformed the other conventional ML models with a coefficient of determination(R2)of 0.89 and a root mean squared error(RMSE)of 5.8 m for the prediction of rockhead position based on limited borehole data. The probabilistic N-XGBoost model not only achieved a higher prediction accuracy, but also provided a predictive estimation of the uncertainty. Thus, the proposed N-XGBoost probabilistic model has the potential to be used as a reliable and effective ML algorithm for the prediction of rockhead position in rock and geotechnical engineering.
Rockhead or depth to bedrock (DTB) refers to the interface between soil(or completely weathered rock)and fresh rock.DTB is a critical design parameter for tunneling or underground construction(e.g.Cremasco,2013;Wei et al.,2017;Cho et al.,2019;Du et al.,2019; Zhang et al., 2020a,b). A reliable mapping of the rockhead position could help reduce the construction risks or project cost(Wei et al.,2017).The DTB is conventionally identified by borehole data or geophysical methods such as ground-penetrating radar(GPR), reflection seismology, and electrical resistivity (Adepelumi and Fayemi, 2012; Yu and Xu, 2015; Nath et al., 2018; Pan et al.,2018; Du et al., 2019; Moon et al., 2019; Baˇci′c et al., 2020;Bressan et al.,2020).However,it is expensive or labor-intensive to drill many boreholes for DTB determination.On the other hand,the estimation of the DTB in between boreholes may be less reliable if the borehole numbers are insufficient or the spacing between the boreholes is too large(Nath et al., 2018).

Fig.1. Boreholes along MRT project line and two-dimensional (2D) geological map of Singapore.
In Singapore, a huge number of borehole data have been collected and integrated into a three-dimensional (3D) geological model for cost-effective future urban planning (Pan et al., 2018).The geological strata are usually interpreted using commercial software based on the limited borehole information.The geological condition between boreholes is roughly estimated by the Kriging interpolation method(Zhu et al.,2012;Themistocleous et al.,2016;Pan et al., 2018). Although the Kriging interpolation method is widely used, its performance runs lower than expected when the dataset is characterized as nonlinear or sparse (Qi et al., 2020a).Meanwhile, results interpreted by this approach could involve significant conflict against engineers’geology knowledge of the site in a complicated case.Different from geological strata,the rockhead is normally distributed in the same formation.Due to weathering or other geological complications,the prediction of rockhead through Kriging interpolation based on limited borehole data is still challenging.
With the rapid development of artificial intelligence (AI), machine learning (ML) can provide a promising and effective way to deal with challenges in engineering prediction (Dixit et al., 2020;Fuentes et al.,2020;Huang et al.,2020;Zhang et al.,2021a,b;Zhao et al.,2021).A good ML system could reduce the cost of manpower and provide an accurate reference for making decision through learning the inherent laws from a big dataset. For instance, a general regression neural network was introduced to present spatial distribution of soil type using borehole data(Zhou et al.,2018).The method is able to predict the simple soil distribution in an area of 72 m×40 m with only spatial coordinates.Support vector machine(SVM) method has also been applied for interpreting sparse geological information (Smirnoff et al., 2008), where the task is regarded as a pure classification problem, and a cross-validation procedure is conducted for describing findings from different training sets.In this case,SVM can be considered as a novel learning method for treating small data samples,especially when boreholes and cross-section data are limited. Wei et al. (2017) built a global spatial bedrock prediction model based on the random forest and gradient boosting tree algorithms,but the data among spare global areas strongly impacted the precision of the proposed models in a local region. Qi et al. (2020b) employed polynomial regression,spline interpolation, one-dimensional (1D) spline regression, and Bayesian-based conditional random field algorithms to spatially predict the soil-rock interfaces.The results indicated that the spline regression method had outperformed the other three algorithms.Nevertheless, the prediction was merely based on the statistics method and the attributes (e.g. location, topographical features)were not considered in the modeling process. The quantile regression forests(QRF)were used by Chen et al.(2020)in a spatial model to predict the soil thickness of loess deposits in central France, but the prediction accuracy was poor and only a mean coefficient of determination R2of 0.33 was achieved.Overall,most of the existing studies only adopted geoscience statistic methods or single ML regressor as their core algorithms, and the prediction accuracy of those models still has room for improvement.
In 2016, Chen and Guestrin (2016) proposed the extreme gradient boosting (XGBoost) model, which is a powerful scalable tree boosting ML framework and a sparsity-aware algorithm for sparse data. For efficient performance, XGBoost implements the architecture of gradient boosted decision tree which could yield high accuracy in both classification and regression tasks.It has been applied in disease prediction (Budholiya et al., 2020; Davagdorj et al., 2020), gene expression prediction (Li et al., 2019), casualty prediction for terrorist attack (Feng et al., 2020), industrial prediction(Zheng and Wu,2019),and construction engineering(Zhao et al., 2019; Duan et al., 2020a; Zhang et al., 2020a, c, 2021a, b).However, the application of XGBoost in rockhead prediction with limited and sparse borehole data has not been reported so far.
Motivated by the increasing demand for underground development in Singapore, this study proposed a hybrid ML framework titled N-XGBoost based on the XGBoost and the natural gradient boosting(NGBoost)methods to improve the predictive accuracy of DTB.In this framework,XGBoost is used to be the base learners of the NGBoost algorithm.Borehole data and local terrain parameters from a tunneling project in Singapore were chosen as the data source to train,validate,and evaluate the proposed ML framework.Other existing ML algorithms such as multivariate linear regression(MLR), artificial neural network (ANN), SVM, gradient boosting regression tree (GBRT), and light gradient boosting machine(LightGBM) were also evaluated by the same dataset in this study for the purpose of comparison.The main contributions of this study include: (1) the proposal and application for the first time of the XGBoost-based hybrid ML model for the accurate prediction of bedrock depth with limited borehole dataset,and(2)the ability of the proposed model to provide not only accurate point prediction but also estimation of the predictive uncertainty for reliable decision-making.
Understanding the geological formation is necessary for construction and evaluation of the proposed ML model. From a regional-scaled view, Singapore and its several smaller islands are lying in the southern extension of the Malaysian Peninsula,with a total land area of about 650 km2(Sharma et al., 1999; Qi et al.,2020a). As shown in Fig.1, the geological formation of Singapore contains three main parts:sedimentary rocks(Jurong formation,JF)in the west,igneous rocks(Bukit Timah granite,BTG)in the central,and quaternary deposits(Old Alluvium soils and soft soils deposits called Kallang formation, KF) in the east. BTG is the largest physiographic area for Singapore, which is characterized by hills and valleys of both high and low relief.Most of the hills in this area are less than 60 m in height, however, the granite near its western contact to JF formed steeper and more prominent hills that the highest one is raised up to 163 m.The BTG is a general name for the acid rocks including granite,adamellite,granodiorite,and the acid and intermediate hybrids (Qi et al., 2020a). Due to the humid tropical climate in Singapore,the acid rocks in BTG formation have been heavily weathered. The thickness of residual soils derived from weathered BTG ranges from a few meters up to 70 m and the average thickness is 30 m (Sharma et al., 1999; Wee and Zhou,2009). With the aims to ensure a safe underground development in Singapore,borehole investigations were carried out in the area of interest to investigate the geological conditions.Based on practical engineering experience in Singapore, the rock mass weathered in Grades I to III is classified as rock whereas the rock mass weathered higher than Grade III is usually regarded as soil-like materials(Sharma et al., 1999). The rockhead is normally regarded as the elevation between Grades III and IV in engineering practice (Qi et al., 2020a).
To support the underground space development in Singapore,Singapore Land Authority is working to develop a 3D map of subsurface utilities. In this case, GeM2S is established as a web-based 3D design tool for managing the shallow borehole data to present the subsurface formation for future underground projects in Singapore (Pan et al., 2020). However, in the 3D geological model construction process, interpolation methods like Kriging interpolation followed by expert justification could be time-consuming and introduce unexpected uncertainties when the geo-model is complex (Smirnoff et al., 2008). Therefore, it is desirable to utilize the ML techniques to estimate the unseen geological information between boreholes and automatically update the 3D geological model when new data are obtained. In this paper, a hybrid ML framework was proposed, and a comprehensive comparison with other conventional ML methods was discussed.

Fig. 2. XGBoost-based ML framework for spatial rockhead prediction. GSE: ground surface elevation; EVRS: explained variance regression score; MAE: mean absolute error.

Fig. 3. Architecture of MLP-ANN.
Fig. 2 illustrates the framework of this study. The framework consists of three parts:
(1) Data preparation: To identify DTB for each borehole and prepare covariates; To polish up the quality of the dataset using synthetic minority over-sampling technique for regression (SMOTER) with introduction of Gaussian noise(SMOGN); and To change the values of data to a common scale, without distorting differences in the ranges of values or losing information by using min-max normalization;
(2) ML model establishment: To build N-XGBoost and other existing ML models; and
(3) ML model evaluation: To compare the performance of the proposed hybrid ML models with other conventional ML models.
3.2.1. Multivariate linear regression (MLR)
MLR is regarded as an elegant algorithm for solving nonlinear relationship between covariates and target variable. The quality of MLR model depends on the degree of correlation between the input and predicted values (Prion and Haerling, 2020). For the output y with predictor valuables of {X1, X2…, Xp}, the model can be expressed as012p0

where βprepresents the regression parameter,and ε is the Gaussian random variable which follows ε~N(0,σ2) (Olive, 2017). The estimation of the regression parameters is based on the criteria of minimizing the sum of squared error (SSE) for achieving the best performance. If the number of predictors is greater than two, the regression equation will be a hyperplane (Young, 2017). The simplicity and interpretability of the linear regression method makes it the basis of many other ML algorithms.
3.2.2. Artificial neural network (ANN)
ANN is widely applied to many nonparametric and nonlinear problems with complex mappings from input to output. It is contributing to identifying the relationship between known variables and unknown parameters (Thanh et al., 2019). Multi-layer perceptron (MLP) network, as one category of ANN, is a typical and most common feedforward network that gains its advantages of nonlinearity and robustness regarding its mapping process from inputs to outputs (Svozil et al., 1997). The signal forward and backpropagation of errors allows the weights to be updated effectively according to the learning rule. Similar to the biological nervous system,MLP consists of a plenty of neurons that interact with the corresponding links between each other.As shown in Fig.3,the neurons are organized in forms of layers which can be categorized as input layer,hidden layers,and output layer(Ahmadi,2015).The ultimate results come out by integrating the solution in each output layer with an activation function f.Thus,the integrating outputs for inputs Xiof forward process can be expressed with weight wijas

However, the selection of ANN structure, which is usually by trial and error, is still crucial since there is no specification for the selection of hyper-parameters to guarantee the model’s performance (Lawal and Kwon, 2020). In this study, an MLP-ANN model was built for the prediction of DTB based on six input features as shown in Fig. 3.
3.2.3. Support vector machine (SVM) for regression
SVM was first developed by Vapnik and Cortes(1995)as a new approach in ML technology. The basic idea of SVM is to build a linear hyperplane based on the kernel function that separates samples with different classes into a high-dimensional space.Theoretically,with a given training dataset{Xi,yi}n,where Xiis the high dimensional input and n is the number of training data, the output yican be described as

where y?(-1,1), ω is the weight vector which is normal to the hyperplane, and b is the hyperplane bias.

By introducing a kernel trick, SVM can be generalized to a nonlinear classifier and regression problems. Eq. (3) can be modified to where ?(X) is the kernel function that converts the input space X into a higher feature space, and typically considered kernel functions are the linear,polynomial,radial basis,and sigmoid functions.As shown in Fig.4,the principal of SVM for regression is to basically consider the sample points that are within the following range:

The best fit line is the hyperplane that has a maximum number of points,which is the mean squared error between prediction and observation. Hence, the SVM model can be trained by a large number of training data to obtain the optimal model for prediction.
3.3.1. XGBoost algorithm
Chen and Guestrin(2016)proposed a highly scalable end-to-end tree boosting system, i.e. XGBoost, which has been widely applied and optimized in many research fields ( Li et al., 2019; Zhao et al.,2019; Zheng and Wu, 2019; Feng et al., 2020; Wang et al., 2020;Zhang et al.,2020b,2021a).XGBoost is an improved framework of the GBRT model. As shown in Fig. 5, GBRT is a boosting model consisting of a series of basic regression tree through a sequential ensemble technique.It can adaptively add more trees to enlarge the model capacity.
Therefore,the final prediction of model can be expressed by

where m is the number of regression trees for boosting; θjis the parameter for controlling the structure of j-th tree;α is shrinkage factor or learning rate of individual regression tree; X is the predictor andjis the prediction of the j-th regression tree; and fj(X,θj)is the output of the j-th regression tree based on structure of θjwithout shrinkage, in which predictor X and the residual y-j-1are used as its inputs.Consequently,the residual will generally reduce with the increased number of regression trees.The objective of gradient boosting regression is to find the optimal θjand build fj(X,θj) at the j-th step to minimize the objective function as

where l is the loss function, which usually uses the squared error between the predictive valueand the ground truth y.
Compared to the conventional GBRT algorithm,a regularization term was introduced to the conventional loss function in XGBoost by Chen and Guestrin (2016) to penalize the complexity of model and prevent the model from overfitting. In XGBoost, Eq. (7) is rewritten as

where Ω(θj)is the regularization item on the j-th regression tree to prevent overfitting:

where Tjis the number of leaves in the j-th regression tree,γ is the minimum loss reduction needed for a further node partition in regression tree,λ is the regularization term on the weight of leaves in regression tree,and w(j)k is the weight of the k-th leaf in the j-th regression tree. It is evident that more leaves (larger Tj) will be penalized by a larger factor γ to minimize the objective function.Therefore,the XGBoost method uses the greedy algorithm to build the regression trees according to the objective function. Based on the last-step predicted residuals, all regression trees are gradually determined through training using a forward stepwise method,and such an XGBoost model is completed.
In the XGBoost model,the tree parameter θjcan be determined through the training process,but some of the hyperparameters like γ,λ,m,a and dmaxshould be specified before training.Herein,dmaxis the maximum depth of regression tree (e.g. dmax= 4 in Fig. 5).More details of the XGBoost algorithm can be found in Chen and Guestrin(2016).
Many researchers found that the hyperparameters could significantly affect the final performance of ML models(Rodriguez-Galiano et al., 2015; Duan et al., 2020a; Feng et al., 2020; Zhang et al., 2020c, 2021a). Hence, hyperparameters should be determined for the best performance of ML model.The commonly used methods include grid search, random search, and Bayesian optimization to fine tune the hyperparameters in the ML model(Wang and Sherry Ni, 2019; Zhang et al., 2021a). The first two methods would roam the full space of available parameter values in an isolated way, while Bayesian optimization method could find the optimal parameter combination by considering the past evaluations through a more efficient way (Gao and Ding, 2020; Zhang et al., 2021a). In this study, Bayesian optimization method was adopted to adjust the following four key hyperparameters which has a high impact on the XGBoost model to optimize its performance:
(1) max_depth (dmax): it controls the complexity of model. A more complicated model is much easier to be overfitted.
(2) learning rate (α in Eq. (2)): it is a crucial hyperparameter in most of the ML algorithms.It can be adjusted to make model more robust.
(3) gamma (γ): it controls regularization in Eq. (5), and the optimal value of γ could help prevent overfitting.
(4) lambda (λ): it controls regularization on weights to avoid overfitting.
Fig.6 shows the diagram of ten-fold cross-validation.In ten-fold cross-validation, the training dataset is divided into ten subsets,and nine of the ten subsets are taken to train the model whereas the remaining one is used to validate the model in each iteration.In the end,the evaluation indicator(e.g.root mean squared error(RMSE)for regression problem) of the ten iterations demonstrates the overall performance of the ML model on the current hyperparameter combination. With the combination of Bayesian optimization and cross-validation methods, the final ML model tuned by the optimal hyperparameters could ensure a better generalization performance on unseen data.
Furthermore, LightGBM as another advanced GBRT-based framework is also compared in this study. Unlike most other implementations that grow trees level-wise, LightGBM grows the trees leaf-wise pattern instead. It chooses the leaf that it believes will yield the largest decrease in loss.As a result,the development of LightGBM focuses on performance and scalability. Due to the similarity in base theory,the details of LightGBM are not presented here and can be found in Ke et al. (2017) and Liang et al. (2020).

Fig. 5. Schematic diagram of the GBRT.

Fig. 6. Ten-fold cross-validation for evaluation of model with specific hyperparameters.
3.3.2. Natural gradient boosting (NGBoost) for probabilistic prediction
In addition of accurate point prediction, the predictive interval of an ML model is crucial in real practice. Duan et al. (2020b) proposed a supervised ML algorithm for generic probabilistic prediction. It outputs a full probability distribution over the entire outcome space. The core of NGBoost is that it utilizes boosting technique to estimate the parameters of a conditional probability distribution Pθ(y|X) as function of X. Fig. 7 shows the conceptual work flowchart of NGBoost,which includes three main parts:base learner (f), parametric probability distribution (Pθ), and scoring rule (S).
In this study, a hybrid NGBoost modular with XGBoost base learners was designed to perform both the point prediction and probabilistic prediction. In this hybrid ML model, input features were fitted to the XGBoost base learners to produce a probability distribution of the predictions Pθ(y|X) over the entire outcome space of y. The S(Pθ,y) is used to optimize the NGBoost model by using a maximum likelihood estimation (MLE) function, which provides calibrated uncertainty and point predictions. The input features of this model include borehole coordinates, GSE, and parameters in digital elevation model (DEM) such as slope, aspect,and curvature.The target value is the elevation of rockhead.During the modeling process, the XGBoost base learner was fine-tuned prior to the training of the proposed hybrid model for a better point prediction performance.
A huge number of geotechnical borehole data has been collected over the years from various construction projects carried out in Singapore. In this study, 502 borehole’s data along one MRT line(shown in Fig.1 as green points) was sampled as DTB observation data. The DTB of each borehole was identified manually according to boreholes logs.

Fig. 7. The hybrid NGBoost model with base learner XGBoost.

Fig. 8. Distribution of DTB in this study: (a) Histogram of DTB, and (b) Distribution of DTB in boxplot.
Fig.8a shows the distribution of the target variable DTB,where a long-tailed distribution of DTBs lower than -33 m was found.Further can be seen, the number of DTBs lower than -33 m was few. Thus, the boreholes with DTB deeper than -33 m were recognized as outliers in the boxplot view in Fig.8b.However,as a purely data-driven methodology, ML could be strongly affected by the outliers due to the imbalanced distribution of DTB in the original samples. To solve this problem, a new data preprocessing method called SMOGN was adopted.
Several studies have claimed that the thickness of the soil is likely to be related to the local terrain(Themistocleous et al.,2016;Wei et al., 2017; Simon et al., 2020). Therefore, both the borehole data from site investigation and the local terrain features(i.e.slope,aspect, and curvature) derived from the high precision DEM of Singapore were utilized to create the dataset for the ML model established in this study.Table 1 presents the summary statistics of the dataset.
In this study, the observations of DTB, borehole locations, and local terrain features under the Singapore coordinate generated a dataset which was then randomly divided into a training set(80%of the whole data) and a testing set (20% of the whole data).
In order to overcome the performance degradation problem caused by imbalanced data, Torgo et al. (2013) proposed the SMOTER algorithm which could change the distribution of the given training dataset to balance the rare and the most frequent ones.Branco et al.(2017) further introduced Gaussian noise to the SMOTER, i.e. SMOGN, for dealing with imbalanced regression problems where the most important cases to the user are poorly represented in the available data. SMOGN can generate new synthetic examples with SMOTER only when the seed example and the k-nearest neighbors(KNN) selected are‘close enough’ and use the introduction of Gaussian noise when the two examples are ‘more distant’.As shown in Fig.9,the key idea of SMOGN algorithm is togenerate new synthetic samples with the three nearest neighbors of seed case which are supposed to have similar DTBs and local terrain features (e.g. slope and aspect). Therefore, SMOGN was adopted to oversample the rare data points(DTB<-33 m)in this study to help improve the robustness of ML model for predicting a deeper DTB. More details on the SMOTER and SMOGN algorithms can be found in some references (Torgo et al., 2013; Branco et al.,2017).

Table 1 Summary statistics of dataset.
As a usual data preprocessing method, normalization not only enhance the overall predictive performance of ML models,but also improve the computing efficiency (Pu et al., 2019; Yu et al., 2020;Zhang et al.,2021a).In this study,the min-max normalization was adopted to convert the dataset to a range from 0 to 1:

where i is the sample index,and fi(k)denotes the i-th sample in the k-th feature domain.
The innovative idea of SMOGN method is to oversample the minority in training data to improve the predictive ability of ML model based on two oversampling techniques by the KNN algorithm distances in features space underlying a given observation(Branco et al., 2017). If the distance between given observation is close enough,SMOTER is applied.If the distance is too far,Gaussian noise is introduced into SMOTER to oversample.

Fig. 9. Synthetic example of the application of SMOGN.
As presented above, the primary objective of this study is to build a novel method of AI for predicting the rockhead elevation of engineering practice based on XGBoost model and NGBoost probability prediction algorithm, called N-XGBoost methodology.Accordingly, borehole data together with local ground surface parameters such as slope, aspect, and curvature were prepared as predictors. The rockhead elevation was carefully recognized by experts from the borehole log and was regarded as the target variable. There were 502 boreholes data which have been randomly divided into two parts for training and testing purposes.That is 80%of the total samples used for training the N-XGBoost model by tenfold cross-validation strategy, whereas the remaining 20% is used for testing the precision of the developed N-XGBoost model. With the aims to overcome the bias influence of rare samples in regression,a preprocessing method called SMOGN was introduced before training in this study to improve the predictive capabilities of N-XGBoost in a larger space. Fig.10 shows the flowchart of NXGBoost model in this study.
In the developed hybrid N-XGBoost model, the XGBoost algorithm was introduced to NGBoost probability prediction algorithm as the base learner. To establish the model, an initial XGBoost model was first fitted by the training dataset. Meanwhile, the four key hyperparameters of XGBoost model were chosen after trial and error, and were further optimized by the Bayesian optimization algorithm. With the achievement of optimal XGBoost model, its predictive accuracy can be enhanced to some extent.
For comparison, the popular ML models like MLR, MLP-ANN,and SVM were also trained and generated based on the same training dataset. All the developed ML models were evaluated by indices like R2,MAE, and RMSE under the same testing dataset. R2represents the correlation and fitting goodness between the target and real values.MAE is a measurement of average errors for all the predictions.RMSE is widely applied when sensible error estimation is required.In rockhead elevation prediction,RMSE and MAE are in unit of meter. EVRS was also used to evaluate the explained variance of model. The higher the EVRS, the better the explained variance of model.
For n target values, the statistic criteria stated above can be calculated by

4.3.1. SMOGN in preprocessing
Fig.11a shows the distribution of rockhead elevation of original training data in this study.It demonstrates that the samples deeper than-33 m are so rare that it can be considered as abnormal points in boxplot.To overcome the data imbalance problem,SMOGN was adopted to change the distribution of minority samples as shown in Fig.11b. The benefits of SMOGN in improving the performance of ML models will be presented in Section 4.4.Additionally,to reduce the effects of different scales of features on performance of the ML models, the training dataset should also be normalized by Eq. (1).
4.3.2. Training ML models
As mentioned in Section 3.3.1,an initial XGBoost model as the NXGBoost base learner was developed with four important hyperparameters in this study: max tree depth (dmax), learning rate (α),minimum loss reduction (γ) and L2 regularization factor (λ). The optimum values of these four parameters were picked up by Bayesian optimization. Bayesian optimization is widely used for searching the value of the minimized objective function by establishing an alternative function according to the evaluation results.It becomes a powerful tool when the objective function is unknown and operation is complex(Zhou et al.,2018),therefore eliminating plenty of wasted effort.Other hyperparameters were set with their default values. Table 2 shows the results of hyperparameters tuning. All the other ML models were also trained with the best hyperparameters set obtained by Bayesian optimization based on the same training dataset.
After the optimum hyperparameter was set,the training dataset was used to fit the ML models described above.The performance of each ML model was evaluated by the ten-fold cross-validation method. In the training stage, the developed XGBoost model was systematically compared with three conventional ML models and GBRT. Fig. 12 shows the comparison results with respect to the predictive accuracy and robustness under ten-fold cross-validation.It demonstrates that the developed XGBoost model with an average R2of 0.895 achieved a better performance than GBRT and LightGBM,and outperformed the other three ML models significantly as well.Meanwhile, the curves of R2for the three conventional ML models(MLR,MLP-ANN,and SVM)indicates that they have poor robustness with different training data subsets.
Fig. 13 illustrates the accuracy of the predictive values by the different ML algorithms for estimating the rockhead position(DTB).Overall, the tree-based ML models (LightGBM, GBRT, and XGBoost)have a higher prediction accuracy than SVM,MLR,and MLP-ANN for determining the DTB values with limited data. The developed XGBoost model achieved the highest R2and lowest RMSE among those ML models.
The performances of the prediction of rockhead position (DTB)in the developed XGBoost model and the other five ML models(MRL, MLP-ANN, SVM, LightGBM, and GBRT) were both evaluated in training and testing datasets by the four indicators (R2, RMSE,MAE,and EVRS). Fig.14 shows the predictive results of the testing dataset using different ML models, which demonstrates that the tree-based models are more suitable than other conventional ML models to predict the DTB based on sparse borehole data. The performances of different models with or without SMOGN preprocessing are presented in Table 3.The developed XGBoost model achieved the best performance among all the ML models in both training and testing datasets. It also can be concluded that the SMOGN preprocessing method could help improve the overall predictive performance by oversampling in rare samples distributed below -30 m.
Based on the results shown in Table 3, the developed XGBoost model was selected to combine with NGBoost algorithm to make probability predictions.

Fig.10. Establishment of N-XGBoost model in this study.

Fig.11. Preprocessing results: (a) Before SMOGN, and (b) After SMOGN.

Table 2 Hyperparameters tuning for XGBoost in this study.

Fig.12. The average coefficients of determination R2(AVG)for training data under tenfold CV.
A predictive intervalgivesuncertaintyarounda predictionvaluein regression analysis.For example,the 95%predictive interval of model with a certain input is a range(L,H),which means that 95%of samples with the input based on this model will give predictions fall into the range(L,H).In this study,the NGBoost algorithm with the developed XGBoost model was used for the estimation of the predictive interval.The comparison between N-XGBoost predicted DTB values and the actual DTB observations is shown in Fig.15,where 82.3%of the actual DTB values in the testing dataset were found within the 95%predictive interval.Note that,in order to better present the distribution of DTB, the predictive and observed values were plotted in a crosssection view, where the original point is the westmost point in Fig.1.It can be concluded from Fig.15 that the N-XGBoost model not only achieved a good point prediction accuracy,but also provided a reliable predictive interval that could take the uncertainty caused by the quality of data and number of samples into account.
However, the imbalance problem of training dataset could seriously affect the model’s performance. Fig. 16 shows the predictive results of N-XGBoost model without SMOGN preprocessing.It clearly demonstrated that the predictive interval in the right cross-section is much wider than that in Fig.15 because only few samples are presented in this area without SMOGN.Figs.15 and 16 clearly demonstrated that the adoption of SMOGN method in the preprocessing stage could greatly improve the predictive performance of ML model under sparse data conditions.
Feature importance is a useful indicator to estimate the contribution of a feature to the target value. A benefit of using ensembles of decision tree methods like gradient boosting (e.g.lightGBM, XGBoost, and Random Forest) is that they can automatically provide estimates of feature importance from a trained predictive model. Generally, feature importance provides scores that indicate how useful or valuable each feature was in the construction of the boosted decision trees within the model. The more frequent a feature used to make decisions, the higher the importance of the feature.Therefore,a trained XGBoost model can provide the rank of feature importance for a specific application.Features with higher scores can be regarded as being more important than those with a lower score to affect the predictive results (Gao and Ding, 2020). Fig. 17 shows the relevant importance of six features in terms of‘weight’in this study.The weight shows the normalized number of times the feature is used to split data in the nodes of trees,which indicates the relative importance of features. The coordinate x is the most important feature variables, followed by ground elevation, coordinate y, slope, aspect,and curvature. The ranking in Fig. 17 indicates that the spatial distribution of rockhead position along the x-direction (westeast) varies greatly, which is consistent with the actual situation.The GSE also has a considerable contribution to the prediction of rockhead because the depth of rockhead has a positive correlation with the altitude.
Position of rockhead is an important design parameter for tunneling and underground construction. In this paper, a hybrid XGBoost based ML model was proposed for predicting the rockhead position based on limited borehole data. To improve the performance of the XGBoost model, the hyperparameters were fine-tuned by the Bayesian optimization algorithm and SMOGN was introduced to balance the data distribution. The predictive results of the hybrid XGBoost model were compared with five different ML methods (i.e. MLR, MLP-ANN, SVM,LightGBM,and GBRT).The comparison results demonstrated that the proposed hybrid XGBoost model has the highest prediction accuracy for both training and testing datasets.It is worth noting that the SMOGN can effectively solve the imbalanced data distribution problem and significantly improve the performance of ML models, especially for predicting rare extreme values of a numeric target variable. Furthermore, the developed XGBoost model was combined with NGBoost method as a base learner to estimate predictive uncertainty. The results demonstrated that 82.3% of the actual rockhead values in the testing dataset successfully dropped into the 95% predictive interval of the hybrid N-XGBoost model.

Fig.13. Comparison of predictive results among ML models in training phase.

Fig.14. Predictive results of ML models for test dataset.

Table 3 Comparison of the predictive models both in training and testing datasets.

Fig.15. Plots of the predictive capability of the hybrid N-XGBoost model with SMOGN:(a)Prediction with uncertainty estimation of test data along the MRT line,and(b)Prediction of training data along the MRT line.
Lastly, the feature importance was analyzed using the XGBoost algorithm and the results showed that coordinate x, GSE, and coordinate y were the top three features that could mostly affect the predictive results of rockhead position in this study.The reason may be that the variation of DTB is greatly affected by the spatial coordinates and GSE in the study area.
Although the proposed model obtains desirable predictive results,there are some limitations that need to be addressed in the future study:
(1) Because the predictive performance of ML is greatly affected by the number and quality of the observation dataset,increasing the number of high-quality borehole samples, as well as the available features would help to level up the prediction accuracy.
(2) Since the borehole information is regarded as discrete data samples in this study, the prediction of rockhead is strongly related to the current GSE and limited spatial relationships among rockhead points. Some other features not detected may also have influences on the predictive results, such as seismic velocity of rock, mechanical parameters of rock sample, and rock quality. To further improve the performance of the proposed model,these continuous features can be included in the training model.

Fig. 16. Plots of the predictive capability of the hybrid N-XGBoost model without SMOGN: (a) Prediction with uncertainty estimation of test data along the MRT line, and (b)Prediction of training data along the MRT line.

Fig.17. Feature importance ranking.
(3) The zone of interest is line-type in this study. The performance of the proposed model in a more complex shape of area (e.g. rectangle, circular, and irregular) needs to be further validated.
Declaration of competing interest
The authors wish to confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its outcome.
Acknowledgments
This work is supported by National Research Foundation(NRF)of Singapore, under its Virtual Singapore program (Grant No.NRF2019VSG-GMS-001), and by the Singapore Ministry of National Development and the National Research Foundation, Prime Minister’s Office under the Land and Livability National Innovation Challenge(L2 NIC)Research Program(Grant No.L2NICCFP2-2015-1).Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not reflect the views of the Singapore Ministry of National Development and National Research Foundation,Prime Minister’s Office,Singapore.
Journal of Rock Mechanics and Geotechnical Engineering2021年6期