Jiwei Xie,Jinsong Hung,*,Cheng Zeng,Shn Hung,Glen J.Burton
a Discipline of Civil,Surveying and Environmental Engineering,Priority Research Centre for Geotechnical Science and Engineering,The University of Newcastle,Callaghan,NSW,2308,Australia
b ATC Williams Pty.Ltd.,Singleton,NSW,2330,Australia
Keywords:Site investigation Machine learning(ML)Spatial interpolation Geotechnical distance fields(GDFs)Tree-based models
A B S T R A C T This study introduces a generic framework for geotechnical subsurface modeling,which accounts for spatial autocorrelation with local mapping machine learning(ML)methods.Instead of using XY coordinate fields directly as model input,a series of autocorrelated geotechnical distance fields(GDFs)is designed to enable the ML models to infer the spatial relationship between the sampled locations and unknown locations.The whole framework using GDF with ML methods is named GDF-ML.This framework is purely data-driven which avoids the tedious work in the scale of fluctuations(SOFs)estimating and data detrending in the conventional spatial interpolation methods.Six local mapping ML methods(extra trees(ETs),gradient boosting(GB),extreme gradient boosting(XGBoost),random forest(RF),general regression neural network(GRNN)and k-nearest neighbors(KNN))are compared in the GDF-ML framework.The results show that the GDFs are better than the conventional XY coordinate fields based ML methods in both accuracy and spatial continuity.GDF-ML is flexible which can be applied to high-dimensional,multi-variable and incomplete datasets.Among these six methods,GDF with ET method(GDF-ET)clearly shows the best accuracy and best spatial continuity.The proposed GDF-ET method can provide a fast and accurate interpretation of the soil property profile.Sensitivity analysis shows that this method is applicable to very small training dataset size.The associated statistical uncertainty can also be quantified so that the reliability of the subsurface modeling results can be estimated objectively and explicitly.The uncertainty results clearly show that the prediction becomes more accurate when more sampled data are available.
Geotechnical design and analysis involve making decisions on ground conditions where the information is incomplete.It has been estimated that less than one part in a million of the ground is investigated for any project(e.g.Broms,1980).In most cases,engineers need to interpolate the information obtained from a limited number of locations to a much larger ground volume.At the same time,natural soils and rocks can be highly variable in space(e.g.Noorian Bidgoli and Jing,2014;Wang et al.,2020).The spatial variation nature of the geotechnical properties involves another source of uncertainties.It is the interplay of these two uncertainties that leads to one of the most significant challenges in geotechnical design(e.g.Jaksa,2000).
In practice,site investigation data are used to analyze the safety and reliability of geostructures.In such an analysis,it is common to assign material properties to each soil layer based on empirical methods,and the spatial variation within each layer is ignored.The commonly used empirical methods are the combination of engineering judgment and Na?ve data analysis,such as interpolation using global mean value and interpolation based on inverse distance squared.These methods are somewhat rough and not consistent.Conservative parameters are often used to compensate for the effects of spatial variability.This strategy is not robust and will increase the cost.A good understanding of the subsurface parameters distribution is essential to the geotechnical design and analysis(e.g.Pieczy′nska-Koz?owska and Vessia,2021).The requirement for geotechnical spatial parameters mapping is rapidly rising as it can provide objective insights into the subsurface condition.Phoon et al.(2016)summarized the geotechnical data as Multivariate,Uncertain and Unique,Sparse,and InComplete(MUSIC).These characteristics make spatial interpolation a difficult task in which spatial variability,measurement error and model error need to be handled carefully.
Geostatistical methods and statistics models have been attempted to better characterize in situ soil properties.Geostatistical methods are classical solutions for accounting the spatial correlation.Kriging methods are widely used in spatial interpolation since the 1960s(Isaaks and Srivastava,1989).The Kriging interpolation is a smooth estimation.To obtain a more realistic estimation,random fields theory for probabilistic site characterization has been studied by Vanmarcke(1983)and Phoon and Kulhawy(1999).However,these methods assume the soil property to be statistically homogeneous with a constant mean or a global mean trend.Covariance function and scale of fluctuations(SOFs)have been used to characterize the stationary random field.However,estimating the SOFs is tedious especially in the horizontal direction for geotechnical data due to the sparsity(e.g.Ching et al.,2018).On the other hand,the trend is normally estimated based on the polynomial functions.These procedures involved additional uncertainties(e.g.Wang et al.,2019).Spatial prediction can be treated as a regression problem.However,a global regression model may fail to explain the spatial non-stationarity where local variation exists.Spatial interpolation can be solved by regression methods with constraints,such as the Lasso and GLasso(Shuku and Phoon,2021).The value at target locations can also be calculated by the weighted sum of basis functions,such as the multivariate adaptive regression splines(MARS)(Samui et al.,2015)and Bayesian compressive sensing method(Wang and Zhao,2017;Zhao et al.,2020).However,the basis functions are predefined.The performance of the interpolation may be influenced by the chosen basis functions(Ching et al.,2020).The Sparse Bayesian learning(Ching and Phoon,2017;Ching et al.,2020)adopts polynomial basis functions to customize the geotechnical engineering data.
The machine learning(ML)methods used in the geotechnical site investigation attract more attention recently(e.g.Zhang and Phoon,2022;Zhu et al.,2022).Zhao and Wang(2020)proposed a Bayesian supervised learning method for one-dimensional nonstationary data interpolation with multiple layers.Due to the spatial correlation in the geotechnical data,the intuition is that more attention should be paid to the neighboring points.Thus the radial basis neural networks(e.g.general regression neural network(GRNN))are preferred in geotechnical engineering(e.g.Juang et al.,2001;Wu et al.,2021).The radial basis neural networks are explicitly local with an optimized kernel radius.Therefore,these models might not reveal the spatial dependencies when the SOFs change in space.Gaussian process(GP)regression methods attract more attention recently(e.g.Phoon and Ching,2021).GP is a kernel-based model.An important advantage is the ability to estimate the uncertainties.A GP model is defined based on a mean function and a positive definite covariance function.It allows us to specify prior information on the function space which is useful when limited data are available.The basic assumption for GP is there exists an underlying process for the unknown function.However,for a non-stationary geotechnical field,it is hard to be described by an underlying function and involves difficulties in finding suitable kernels.
Juang et al.(2001)classified the neural network method as global mapping ML and the radial basis method(taking the sample distance into consideration)as local mapping ML.This paper proposes to involve the tree-based ensemble ML methods in geotechnical spatial prediction.The tree-based ML methods can be treated as a kind of local mapping ML as it splits the data into small regions based on the distance and similarity of the data automatically.The basic unit for a tree-based method is the regression tree.The creation of the regression tree starts at the root of the tree and splits the data in a way that results in the largest information gain.According to different ensemble techniques,the ensemble tree-based methods include random forest(RF),extra trees(ET),gradient boosting(GB)and extreme gradient boosting(XGBoost).The relevant implementations are available in python and R through several packages such as XGBoost(Chen et al.,2015),ExtraTrees(Simm et al.,2014),Scikit-learn(Pedregosa et al.,2011)and Ranger(Wright and Ziegler,2017).
In geotechnical applications,tree-based ML methods split the field without fixed kernel or radius,which makes this method adapt to complex ground conditions.Additional advantages of these methods include:smooth and continuous prediction ability even for the highly nonlinear soil profiles;fast computation;input variables importance interpretable;robust in the hyperparameters setting;flexible from interpolation to regression based on setting the hyperparameters;adapt to incomplete datasets;able to quantify the associated uncertainty.The details and inherent connections in these methods are discussed in Section 2.
Geotechnical subsurface modelling is traditionally based on random field theory.The parameters of random field are difficult to obtain if testing data are limited.ML methods are purely datadriven and do not require any prior parameters to be estimated.ML methods have been recently applied to subsurface modelling(e.g.Samui,2012;Viswanathan et al.,2015;Wu et al.,2021;Kim and Ji,2022).However,spatial correlation which is an important aspect in subsurface modelling has not been considered.This is due to the fact that the inputs of ML models are based on Cartesian coordinate system(i.e.XYcoordinates).The prediction process of ML prediction based on Cartesian coordinate includes three steps:(a)transform the samples Cartesian coordinates into distances between samples;(b)find the spatial correlations based on the distances between samples;and(c)make predictions based on the spatial correlations.However,if training dataset is small(i.e.limited geotechnical testing data),ML models cannot fully capture the spatial correlations in Step(b).In this case,ML model will simply predict the unknown value based on the closest sample value.To overcome this issue,this paper proposes a novel geotechnical distance field(GDF)system so that the spatial relationships among training samples are explicitly quantified in the inputs.The consequence is that ML algorithms will naturally link inputs and outputs based on spatial correlations.GDFs are inspired by the Euclidean distance fields(e.g.Rosenfeld and Pfaltz,1968;Behrens et al.,2018)and designed based on the characteristics of geotechnical testing data.ML based on GDFs is referred as“GDFML”in this study.It is a generic framework which can be used in two-dimensional(2D),three-dimensional(3D),complicated soil layers and multiple variables situations.
GDF-ML is a purely data-driven framework,in which no assumptions about basis function or stationarity of the soil properties are need.It is applicable to incomplete datasets and can be easily extended to combine multiple sources of data.To illustrate the performance of the GDF-ML framework,non-stationary random fields are generated and utilized as validation.The results are compared with theXYcoordinate based methods by comparing the root mean square error(RMSE).The GDF-ML framework is explained in detail and compares the prediction efficiency with state-of-the-art ML methods.Six local mapping ML methods(RF,ET,GB,XGBoost,GRNN andk-nearest neighbors(KNN))are compared with accuracy,continuity and computation time within this framework.The applicability of GDF-ML under different sample densities is investigated.Then the best performance model GDF-ET is further investigated.The input variable importance,effect of training dataset size,the physical meaning of the hyperparameters and uncertainty estimation with GDF-ET are illustrated in detail.
There are plenty of methods that focus on quantified spatial interpolation techniques.Most of the methods aim to minimize the variance of prediction error:

whereZ(x0)is the sampled value at pointx0,^Z(x0)is the estimation at pointx0,andx0is the spatial coordinate.A key difference of different methods is how to estimate the^Z(x0).The mainstream methods for geotechnical data interpolation are summarized and compared as follows.
Kriging is one of the most popular spatial interpolation techniques(e.g.Zou et al.,2017).In Kriging method,^Z(x0)is calculated under the constraint of unbiasedness.The unbiasedness constraint is

With suitable parameters,Kriging has been proved to give the best linear unbiased prediction at unsampled locations(Chiles and Delfiner,2012).By minimizing variance,the Kriging weights can be determined as follows:

whereC(xi-xj)=Cov[Z(xi),Z(xj)]is the covariance between the known data points,C(x0-x0)=Cov[Z(x0),Z(x0)]equals the variance,andC(xi-x0)=Cov[Z(xi),^Z(x0)]is the covariance between the known and target data points.
The covariance is used to estimate the relationship between two variables and their mean value.One of the important parameters in the covariance model is the SOFs.However,these parameters are difficult to estimate in geotechnical engineering due to the sparse data distribution in the horizontal direction(Onyejekwe et al.,2016).In addition,Kriging assumes stationary or intrinsic stationary of the mean so that a detrending process is needed,which will involve additional uncertainty.
Spatial interpolation from limited data is a not well-posed inverse problem.Hence additional constraints are necessary when estimating the^Z(x0).The Lasso regression model performs L1 regularization as a constraint.LetXbe the inputs in the sampled locations andzbe the target values.Then the objective of lasso is to solve

where w is the coefficient vector,andtis a prespecified parameter that is related to the degree of regularization.L1 regularization can result in sparse models with few coefficients.Shuku and Phoon(2021)improve the applicability of Lasso method for geotechnical data using an efficient implementation of the alternating direction method of multipliers.
Another solution for spatial interpolation is to use a series of weighted sum of basis functions to approximate the target variable:

Eq.(5)is a weighted combination of basis functionsBi(x).wiis theith weight.The form of basis function is predetermined and regression is used to estimate the specified weights for that function.For example,the MARS method(Friedman,1991)utilizes the combination of hinge functions as basis function.Zhang et al.(2021)utilized the MARS method to evaluate pile drivability regarding the prediction of maximum compressive stresses and blow per foot.Bayesian compressive sensing method(Wang et al.,2017)is based on the orthonormal basis(e.g.discrete cosine transform,wavelet transform).The advantage of the Bayesian compressive sensing method is the ability to evaluate the uncertainty.Due to the constrained form of basis function,the applicability and flexibility of these methods may be restricted(Ching et al.,2020).
Local mapping ML is essential in geotechnical site characterization because soil properties are more similar to nearby soil units.Several local mapping ML methods are compared as follows.
2.4.1.KNN algorithm
The most straightforward local mapping method is the KNN method,which is formed by weighted summation over the‘k’nearest data points.The idea behind KNN methods is to find‘k’samples based on distance and predict the unknown values based on these nearby samples.The weights that put in the neighbor data points can be uniform or based on distance.If weights are chosen to be distance,then the weights assigned to the KNN are inversely proportional to distance from the target point.Despite the idea being simple,KNN has been successful in geotechnical interpolation problems(e.g.Rafi et al.,2020).
2.4.2.Radial basis neural networks
The neural network is a popular tool to predict unknown samples.The GRNN is widely used instead of the classical neural network in the spatial interpolation(e.g.Juang et al.,2001;Wu et al.,2021).GRNN is a kind of radial basis neural network,so that the outputs are proportional to the inverse of the distance from the center of the neuron.GRNN can be represented as

whereK(x0,xi)is the kernel function.The Gaussian kernel is formulated as

wherediis the distance between the sampled points and the point of prediction;and σ is the smooth parameter,which is similar to the SOF in random field theory.σ can be anisotropic in each input dimension.It is not easy to determine the σ.When the size of input is large,GRNN will be computationally expensive.
2.4.3.Tree-based ensemble methods
Tree-based ensemble methods are very popular recently because of their powerful nonlinear regression capabilities and they are easy to understand.The regression tree is the basic unit of the tree-based ensemble methods.A regression tree can be constructed by dividing the source data into subsets.This process is repeated on each derived subset in a recursive manner which partitions the data into disjoint regions.Regression tree learning does not make any underlying assumptions with respect to the form of the function to be estimated.Such methods are capable of estimating the unknown function that could be of any form.
Ensemble learning is an ML technique that optimizes predictive models by combining several base models.Boosting is one of the ensemble techniques.Through weighted majority voting,the ensemble learning turns a set of decision trees into a predictive value.Boosting can be described as a technique of reweighting data by applying successive tree-learning models.The weights are modified by evaluating the accuracy of previous iteration.The model will focus on the cases that are difficult to predict so that it can learn from its mistakes.GB models minimize a loss function by minimizing the gradient.The XGBoost model utilizes more accurate approximations to determine the best tree model.Unlike GB,XGBoost uses the second order derivative as an approximation,which can be parallelized and thus very fast.XGBoost has been utilized to predict the spatial distribution of geotechnical properties,such as the rockhead distribution using the spatial geographic information(e.g.Zhu et al.,2021).
BAGGing(Bootstrap AGGregating)is another ensemble scheme to enhance the accuracy of prediction.Bootstrapping is a resampling method which uses random sampling with replacement to generate subsets.Firstly,multiple bootstrapped subsamples are generated from the whole sample.A regression tree can be built on the bootstrapped subsamples.Then these subsample regression trees can be aggregated based on an ensemble scheme.RF Models is based on the BAGGing ensemble technique with a minor difference.A randomly selected subset of parameters will be used in each tree in RF.Bootstrapping generates a larger group of dataset,which leads to more robust and accurate results.
RF is basically a weighted neighborhoods schemes(Lin and Jeon,2006).In a single tree,the neighbor samples are searched automatically and classified into same leaf based on the similarities of the input.The final predictions are a weighted combination of all input samples:

For a regression tree,only whenxiis one ofqpoints in the same leaf asx0,it will be assigned weightwi=1/q.The number ofqis determined automatically based on the sample distribution instead of a fixed numberkin the KNN method.RF is an averaged combination ofmregression trees,thus the final prediction can be described as

In a forest,the neighborhood ofx0depends in a complex way on the structure of the trees instead of the Euclidean distance in KNN and kernel radius based methods.
The ET method is similar to RF but involves randomization in the ensemble scheme.The splitting threshold in the regression tree is selected randomly instead of calculating optimal split-point for each variable.The best split with highest score among all the splitting thresholds is chosen as the final thresholds,which can reduce the prediction variance.Each tree is trained based on the whole samples instead of the bootstrapping subsets in the RF method,which can reduce the prediction bias.When the number of trees is large,the ET appears to be piecewise multi-linear and continuous.In geotechnical testing data,especially when the sampled locations are fixed in several lines,these characteristics are particularly useful for predicting a continuous and smooth geotechnical field.The ET result is much smoother than those of bagging trees and RF(Geurts et al.,2006).
To illustrate the local mapping characterization of the treebased ensemble methods,ET and RF methods can be regarded as kernel-based models(Breiman,2000;Geurts et al.,2006).The interpolation can be represented as

KT(xi,x0)cannot be explicitly described,which is related to the number of trees,input sample size and number of leaves in each tree.For the input dataset withNsamples,the number of leaveslis determined automatically in each tree.The nodes in a tree are expanded until all leaves are pure or until all leaves contain less than minimum samples per splitnmin.nminis adjustable in the treebased models,which is related to the smoothing effect in the final prediction.More detail ofnminwill be discussed later.For an extreme case with infinite sample size and infinite ensembles of trees with fixed number of leaves,the kernel can be approximate to an exponential function:KT(xi,x0)≈exp(-|xi-x0|1/λ),where λ is similar with the SOF concept of the kernelKTand defined by λ=p/log10l,wherepis the dimension of the input space,andlis the number of leaves.
Compared with the conventional kernel methods,the kernel in the tree-based ensemble methods is adaptively adjusted based on the sample distribution.This characteristic makes these methods flexible to simulate various non-stationary random fields,and fully consider the spatial correlation of soil parameters.
This paper proposes a novel GDF system to explicitly quantify the spatial relationships among training samples in the inputs so that spatial correlations can be captured in the prediction even when the size of dataset is small.A series of GDFs is designed based on the characteristics of geotechnical testing data.Geotechnical testing for site investigation is commonly conducted with drilling holes which is a straight line along the depth direction.Due to the limited budgets,data spacing is large in the horizontal direction.Another characteristic of geotechnical data is that the horizontal SOF is much larger than the vertical SOF(Phoon and Kulhawy,1999).Soil layers tend to be horizontally layered and commonly have nonlinear boundaries.GDFs explicitly describe the autocorrelated structure of spatial relationship between the samples,which enables the ML models to infer the unknown samples considering the relevant spatial dependence.Based on the characteristics of geotechnical testing data,the following generic GDFs are designed:
(1)Surface distance field:distance field to the ground surface.
(2)Testing distance fields:distance fields to each geotechnical testing line(e.g.cone penetration test(CPT)).
(3)Corner distance fields:distance fields to the corners of the target geotechnical field.
Using three testing lines as an example,eight GDFs can be generated,as shown in Fig.1.
The idea behind the surface distance field is the nature of the large horizontal scale of fluctuation(SOF)in the geotechnical data.Surface distance field helps to capture the horizontal soil layers.Testing distance fields are designed to help find the adjacent available testing information as the prediction is a weighted summation of neighborhoods.Corner distance fields are designed to divide the whole geotechnical field into small nonlinear regions.This is significantly useful when the soil layers are nonlinearly distributed.Corner distance fields are important for producing continuous and smooth predictions.
The process of generating the GDFs can be shown in Fig.2.Assuming a geotechnical field is discretized into 4×7 grids.The grid size is assumed to be one unit by one unit.Three testing lines are shown as red vertical lines in Fig.2.
Each value in the fields is the Euler distance to the red lines or red points.The red values in these distance fields mean that they are known samples that can be used as training data.The black values are unknown samples which need to be predicted.
Eight geotechnical distance fields are generated so that each sample has eight input values with spatial distance information instead of twoXYcoordinates.For example,sample A is a known data point in line 1 which can be used as one of the training data.The input sequence for sample A in the ML model is[0,0,2,4,3.2,1,5,5.8],which is the distance value in each field with a fixed order.Sample B is an unknown point.The input sequence for sample B in the ML model is[0,1,3,5,3,0,6,6.7].
The first number in the input sequence is the value in the surface distance field.All the samples in the same row will have the same number.The idea behind this is that the horizontal SOF is usually large so that the geotechnical variables in the same row tend to be similar.On the contrary,the vertical SOF is usually small,thus the samples in different rows will have distinct values.
The second to fourth numbers are the values from the testing distance field.This explicitly leads the ML model to find the adjacent available testing information,because the final prediction is a weighted summation of neighborhoods.
The last four numbers come from the corner distance fields.These numbers aim to provide nonlinear information about the positions.This is significantly useful when the soil layers are nonlinearly distributed.These values are the key to producing smooth and continuous predictions.
The idea behind the GDFs is simple and easy to be applied in engineering practice.However,the benefits are impressive.A comparison of using GDFs orXYcoordinates is conducted in Section 4.The whole GDF-ML framework is shown in Fig.3.
The advantages of the GDFs are described as follows:
(1)There is no need to define nor to fit a covariance model.First and second order stationarities are not required.
(2)There is no need for trend modeling,which is fitted automatically in the GDF-ML framework.
(3)There is no need to define a radius or SOF as in the Kriging or other kernel-based methods.
(4)Produce more accurate and smoother results(more details will be discussed later).
(5)Using individual samples along the testing lines as model input,each sample is assigned a series of input values based on the GDFs.Only the sampled values are needed as outputs to train the models,thus this framework is immune to the missing values in the testing lines(e.g.incomplete CPTs).
(6)The GDFs can be easily transformed to high-dimensional fields by calculating Euclidean distances in highdimensional space.
(7)If additional information is available(e.g.geophysical data),this framework can combine this information by adding additional inputs.
It should be noted that the information provided by the corner distance fields,testing distance fields and surface distance fields unavoidably will be duplicated.Using all of these fields as model input will generally lead to a good result.Due to different patterns of the soil layers such as layers with nonlinear boundaries or trends in different directions,one can also select the important fields from the GDFs based on feature selection methods.For a specific task,removing several unimportant inputs may improve results slightly.However,redundancies are kept here because it is a generic framework which may be applied to different patterns of target random fields.The details will be discussed in the discussion section.

Fig.1.GDFs designed for geotechnical testing data.

Fig.2.Illustration of generating geotechnical distance fields.

Fig.3.Flowchart of GDF-ML framework.
In ML,optimal hyperparameters for an algorithm can be determined based on hyperparameter optimization.Cross-validation is often used to estimate the model generalization performance with specific hyperparameters.The conventional way of performing hyperparameter optimization is the grid search method.The search process involves a manual selection of a subset of the parameter space and an exhaustive search of the results.Then the hyperparameter with the highest score is selected.
A better way is the Bayesian optimization method,which is a global optimization method.In Bayesian optimization,the hyperparameter values and outputs are modelled in a probabilistic manner.Bayesian optimization iteratively gathers new observations to update hyperparameters.Compared to grid search,Bayesian optimization typically yields better results in fewer evaluations.When there are plenty of hyperparameters,it is recommended to utilize the Bayesian optimization methods(Snoek et al.,2012).
The Bayesian optimization is utilized in this paper to obtain the optimal hyperparameters for the six models.It should be noted that the sensitivity of hyperparameters in each model is quite different.The KNN method is sensitive to the number of neighborhoods and GRNN method is sensitive to the anisotropic bandwidth standard deviation parameters.These hyperparameters need to be determined carefully.The tree-based ensemble methods are quite robust in which the hyperparameters do not really need to be fine-tuned(Probst and Boulesteix,2017).
The performance of the ML methods is assessed by estimating the error in the predictions.In the synthetic data,the“true”field is known,so that the prediction error can be determined directly.RMSE is widely used to evaluate the performance of the predictions.
To test the performance of ML methods under the GDF-ML framework,six local mapping ML methods(RF,ET,GB,XGBoost,GRNN and KNN)are applied.The performance of various methods is quantified in terms of RMSE using synthetic non-stationary random field data.In the synthetic cases,the ground truth is known,thus the performance of the approach can be readily quantified.
Two sets of validation procedures are designed for estimating the effectiveness of the proposed framework and finding the best interpolation model under this framework.Firstly,the effects of using GDFs andXYcoordinate fields as model input are compared.Secondly,the performances of these six methods under the proposed framework are compared.Furthermore,the models’performance under different CPT numbers is compared.
Synthetic simulations can be used to generate desired testing data to demonstrate the generality of the proposed methods.A non-stationary corrected cone tip resistanceqtfield is generated as an illustration.Theqtfield is simulated with a resolution of η1=0.05 m and η2=0.5 m along the depth and horizontal directions,respectively.The Gaussian random field is generated based on the Randomization method(He?e et al.,2014).Mean value,standard deviation and SOF along with both the depth θvand horizontal θhdirections are needed to generate Gaussian random fields.In this example,the mean and variance are taken as 8 MPa and 4 MPa2,respectively.The SOFs are taken as θh=20 m and θv=2 m,which are consistent with the values reported in the literature(e.g.Phoon and Kulhawy,1999).
A deterministic nonlinear trend is added to the generated random fields to simulate nonstationary random fields.Linear or quadratic trend is commonly preferred(e.g.Stuedlein et al.,2012).In this study,synthetic data are assumed to obey a simple secondorder increasing trend in depth direction based on Eq.(11).

whereyrepresents the depth(m).
The spherical variogram function(Eq.(12))is adopted to simulate the spatial correlation ofqtfield.

wherehrepresents the distance between two data points,and θ represents the SOF.
The summation of the unconditional random field and the nonlinear trend is treated as a 2D“true”qtfield,as shown in Fig.4.CPT data can be sampled with vertical lines in the“true”qtfield.
Five CPTs are used as an illustration for the proposed framework.The space between CPTs is 25 m which is common in real projects.The performances of all methods usingXYcoordinates and GDFs are compared.The results based on the RMSE value are shown in Fig.5.
The results indicate that the XGBoost,RF and GB methods show obvious improvement under the GDF framework.ET and GRNN also benefit from the GDFs with a minor improvement.An interesting observation is that the more struggling a method is in theXYfield,the more improved it is in the GDFs.KNN result seems to be consistent.This is because the nearest neighborhoods are the same in bothXYfields and GDFs based on the Euclidean distance.The ET method has the best performance under bothXYfields and GDFs.
RMSE is an averaged effect of the accuracy.The performance of the ML approaches can be directly compared by visualizing the interpolation results.Using the widely used RF method as an example,the results of usingXYfields and GDFs are shown in Figs.6 and 7.
Comparing Fig.6 with Fig.7,theXYfields(Fig.6)clearly show linear boundaries of horizontal and vertical lines in the results(obvious artifacts).This is because the random forest trees divided the whole region into small grids based on the coordinate values.In each region,the values are piecewise constant.Obvious discontinuity appears in the middle of adjacent CPTs,which is not the case in the real profile.The GDFs provide the possibility for the nonlinear region division using distance instead of coordinates.More local pattern can be obtained using GDFs as inputs comparing withXYalone.
Similar artifacts of linear boundaries appear in KNN,XGBoost,GB and GRNN results.If GDFs are utilized,the artifacts of linear boundaries will be improved in RF,XGBoost,GB and GRNN results.KNN results will not change as the nearest neighborhoods will not change.ET method has the best performance and continuous spatial pattern both in theXYfields and GDFs.The ET result with the GDFs is shown in Fig.8.
ET asymptotically produces continuous,piecewise multi-linear functions.The results are thus smoother than the piecewise constant ones obtained with other methods.This leads to better accuracy and lower variance in the predicted regions.

Fig.4.Synthetic“true”qt field and CPT samples.

Fig.5.Comparison of RMSE with GDFs and XY fields.
It should be noted that,though only one case is presented here,a number of trial experiments have been done by the first author,which indicates the GDFs basically hold a steady improvement from theXYfields for the tree-based ensemble methods.The GRNN method is improved in most cases,but sometimes not because it is difficult to determine the optimal parameters for each kernel radius.
In this section,CPT numbers from 3 to 10 are adopted to investigate the effect of different information densities on the model accuracy.For each number of CPTs,the proposed GDF-ML framework is adopted to predict the values in unknown locations.Then,using the six methods,the RMSEs are calculated and shown in Fig.9.
The RMSE decreases rapidly when the number of CPTs increases to six,in which the CPT distance is larger than 1 θhand less than 2 θh.When the CPT distance is less than 1 θh(number of CPTs larger than six),the information provided by adjacent CPT will largely overlap,so that the RMSE decreases slower.The mean value of the synthetic field is 10.8 MPa.When there are only 3 CPTs,the proposed framework still gives a reasonable estimation with RMSE around 2.1 MPa.When the number of CPTs increases to a large value,the predictedqtvalue becomes almost identical to that of the originalqtfield in Fig.4,even at the locations far from the sampled CPTs.When there are 10 CPTs,the RMSE decreases to around 1.2 MPa.Considering the final result is an average effect of the original random field,this result is accurate enough.This indicates the proposed framework can be utilized no matter when there is sparse or dense sample information.Even when there are 10 CPTs,the distance between the CPT tests is 11 m,which is acceptable in real projects.The result based on 10 CPTs with the ET method is shown in Fig.10.
Comparing Fig.10 with the original“true”field(Fig.4),10 CPTs with ET method provide an impressive result which nearly captures all local fluctuation in the“true”field.The maximum and minimum values are slightly different,which may due to two reasons:

Fig.6.Predicted qt field based on random forest with XY coordinates.

Fig.7.Predicted qt field based on random forest with geotechnical distance fields.
(1)Sample bias.All the predictions are based on the sampled CPTs.If the range between the maximum and minimum value of the CPT samples is less than the range of the“true”field,there will be a slight difference in the final maximum and minimum values.
(2)The proposed framework produces a smooth interpolation field that averages the individuals,which may smooth the extreme values in the field.This effect is similar to the Kriging methods.
The GDF-ML methods are generally fast enough for training and prediction.The computation time for different methods is shown in Table 1,using a personal computer with Intel(R)Core(TM)i9-9900 CPU@3.10 GHz,32GB RAM in a 64-bit Windows 10 operating system.The fastest method is the XGBoost,which is consistent with the description in Section 2.The XGBoost methods are designed for parallel computation.KNN is fast as it simply assigns weights to the fixed number of neighbors with low computation requirements.The total speed of the tree-based ensemble methods depends on the number of trees used in the models.There are 814 trees and 453 trees used in the ET methods and RF method separately.Due to the randomization,the computation time per tree in the ET method is faster than the RF method.GRNN is the slowest method,which is due to the anisotropic kernel radius for all input parameters.

Table 1Computational time for training and prediction for six methods with 5 CPTs.
As discussed previously,the ET method is the most promising method for geotechnical data interpolation,which can be named as GDF-ET framework.The following discussion focuses on the GDFET framework.The effect of training dataset size,the input variables’importance,hyperparameters and uncertainties estimation are studied as follows.
The geotechnical tests such as the CPT can provide abundant data in the depth direction.For some special tests,the data in the depth direction are also sparse.Although the tree-based ensemble methods are well-known for their ability to deal with small training datasets,the minimum size of training dataset size with the applicability of this framework needs to be investigated.
When the number of testing lines is fixed,the size of training dataset is directly related to the vertical resolution of the test lines.Different resolutions of measurement data are investigated to find the minimum size of training dataset.Using a case with six testing lines(i.e.the distance between adjacent testing lines is 20 m)as an example,100 calculations are conducted with different vertical resolutions.The vertical resolution ranges from 0.05 m to 5 m with step size of 0.05 m.The length of each testing line is 20 m.When the resolution is 0.05 m,there are 400 points in each testing line.When the resolution is 5 m,there are 4 points in each testing line.The RMSE results based on GDF-ET method with different resolutions are shown in Fig.11.
Fig.11 shows that the accuracy and resolution are generally positively correlated.This indicates that the number of available data points is directly related to the model performance.The linearly increasing trend of RMSE indicates that the model performance is stable even with small training dataset size.If a model is not applicable to the small training dataset size,an exponential trend will appear as the error will increase rapidly when the dataset size is small.The linear correlation relationship is strong from resolution 0.05 m to 2.5 m.Then it disperses from 2.5 m to 5 m.This is because when the resolution decreases,the sample bias will strongly influence the model performance.The best RMSE result is around 1.4 MPa which is achieved from the model with 0.05 m resolution.The worst RMSE result is around 2.3 MPa with a 5 m resolution.This indicates that even though the resolution is reduced 100 times,the prediction accuracy only decreases by 64%.This proves that the proposed GDF-ET method is applicable even though the testing resolution is rough.Using the resolution of 2.5 m as an example,the prediction results are shown in Fig.12.
Though the available data points are very sparse in this case,Fig.12 shows that the results are generally similar to the trueqtfield in Fig.4.
In a single regression tree,let random input vector(X1,…,Xp)represents the input variables.Each variable comes from a geotechnical distance field.These inputs contribute to a random output variableZ.The basic idea for comparing the feature importance can be determined in the tree-based method based on the impurity index.For each tree,all nodes with the specific parameters are selected.Then the impurity improvementΔi(st,t)in splitting criterion is computed using this parameter,and a total improvement over all trees is added to obtain feature importance for that parameter.For a regression tree,the impurity can be calculated based on the variance.It can be described by following function(Breiman,2001;Louppe et al.,2013):

Fig.8.Predicted qt field based on extra tree with geotechnical distance fields.

wherep(t)is a weight which is the proportionNT/Nof samples reaching nodet,v(st)is the variable used in splitst,NTis the number of trees,andNis the number of samples.The higher the Imp(Xm),the more important the variable.The results of the importance for each GDF are shown in Fig.13.
From Fig.13,the most important field in GDFs is the surface distance field.This is because of the nature of the synthetic random field:increasing trend and small SOF in the depth direction.The corner distance fields are more important than the CPT testing distance fields.The total importance of all CPT testing distance fields is close to the average of the corner distance fields.If the CPT testing distance fields are removed,the RMSE will increase from 1.54 MPa to 1.57 MPa.It should be noted that these fields’importance values may change according to different patterns of the soil layers such as layers with nonlinear boundaries or trends in other directions.For some specific tasks,removing several unimportant inputs(e.g.importance<0.01)may improve results slightly.
As an ensemble tree-based method,the number of trees is the basic parameter in the ET model.The model performance will be stabilized when the number of trees increases to a critical number.By default,a few hundred trees may provide satisfactory results.ET model is somehow immune to overfitting as the number of trees increases to a large number(Probst and Boulesteix,2017).However,more trees need more computation resources.
An interesting hyperparameter highlighted here is the minimum samples per split in a tree.Regression trees are only split if there are equal or more samples than the minimum samples per split value.This parameter defaults to be value two.Smaller numbers of samples per split result in a deeper tree with more splits which leads to higher variance and smaller bias.
In geotechnical spatial interpolation,this parameter holds an essential physical meaning,which may be used to consider the measurement error in future.Using CPT 1 as an example,the predicted value and true value in CPT 1 under 2 and 10 samples per split are compared in Fig.14.
In Fig.14,the green scatters are the trueqtvalues from the“true”synthetic field.It can be seen that the green points change rapidly since the SOF is small in the depth direction.The predicted result based on different values for the minimum samples per split parameter is drawn with red dash line and green solid line in Fig.14.If the minimum samples per split are 2,the predicted CPT values will remain the same as its true values(green solid line).This is assuming that all the CPT values are completely reliable.This is similar to the Kriging methods.When the minimum samples per split are 10(red dash line)in Fig.14,it is a smoothed version of the sampled CPT values,which is regression instead of interpolation.The larger the minimum samples per split,the smoother the results of the prediction.The number of samples per split can be determined by the optimization methods mentioned above.It is suggested that the optimization space should focus on the larger values when the CPT results are not reliable.
The interpreted soil property profile and its uncertainty greatly affect the geotechnical designs and analyses.When there is limited data available,the estimated mean and standard deviation would largely be affected by the sample bias.Quantification of the interpolation uncertainty is essential but tricky.The ET method addresses this problem by nature as its ensemble style.Multi-learner integration reduces model errors by approximating the true hypothesis.A simplified way to quantify the prediction uncertainty associated with ensemble learning is statistical analysis of the ensemble trees(e.g.Shi and Wang,2021).

Fig.9.Comparison of RMSE with different numbers of CPTs.

Fig.10.Predicted qt field based on extra tree with 10 CPTs.
According to the central limit theorem,the resampling distribution of the mean will approximate a normal distribution.Standard deviation of all predictions from individual regression trees on inputXcan be used to estimate uncertainty of the prediction:

Fig.15 shows that the standard deviation ofqtfield.At locations where there are CPT soundings available,the standard deviation is almost zero,but it is rather large when it is far away from those locations.This makes sense because the information at those locations is currently available and used as training input in the proposed method.On the other hand,the magnitudes of uncertainties are relatively large at unsampled locations,where no information is available.
For a reasonable uncertainty estimation tool,the statistical uncertainty in the interpretedqtprofile should substantially reduce as the number of measured CPT lines increases.In the case of allqtpoints being measured,virtually no statistical uncertainty is associated with each point.For clear presentation,the predictedqtprofiles at location atx=40 m are extracted to show the evolution of uncertainties.This location is selected because it is about the middle of adjacent CPTs in both 5 CPT and 10 CPT cases,which is almost the most challenging location in the predictions.This can also be verified in the standard deviation map in Fig.15.Comparing the red dash line and blue solid line in Figs.16 and 17,it can be seen that as more CPTs available,the simulatedqtvalue(red dash line)becomes increasingly similar to the originalqtprofile(blue solid line).
Figs.16 and 17 include the interpolation uncertainty with a 95%confidence interval by the green filled area.The upper and lower bounds of the confidence interval are calculated based on μqt,i±1.96σqt,i(i=1,2,…,400),where μqt,iis the mean value ofith points,and σqt,iis the corresponding standard deviation,which is calculated from Eq.(14).The 95%confidence interval indicates that there is a high probability that the realqtwill lie within it.The results in Figs.16 and 17 are consistent with this pattern.The 95%confidence interval significantly narrows down as the CPT numbers increase from 5 to 10.This suggests that as the number of CPTs increases,the simulatedqtsamples become more reliable even at far away unsampled locations.This reflects the data-driven nature of the GDF-ET method.

Fig.11.Model performance comparison of GDF-ET method with different resolutions.

Fig.12.Interpolation result for resolution 2.5 m based on GDF-ET method.

Fig.13.Results of the importance for each geotechnical distance field.
It is important to note the advantages of the uncertainties estimation in GDF-ET vs.traditional geostatistical methods.In the Kriging uncertainties estimation,the estimation uncertainties are independent of the target data distribution.The standard deviation only depends on the covariance model and sample distances.However,the prediction uncertainties are largely affected by the target value distribution,for example,the prediction uncertainties in a constant layer will be significantly less than the uncertainties at the boundaries of soil layers.The uncertainties estimation in GDFET framework will be more informative than the Kriging methods.While it should be emphasized that the proposed data-driven approaches are flexible and relatively easy to use,we should also acknowledge the limitation of the proposed approaches.Unlike geostatistical methods such as Kriging,the prediction process in ML cannot be described with formulae.Therefore,the question of the interpretability of the results arises.It is suggested that if we have enough data to estimate the model parameters,geostatistical methods should be considered first.
This study proposes a generic GDF-ML framework for improving the spatial interpolations of geotechnical site investigation data and results in more accurate mapping and characterization of soil profiles with uncertainties estimation.The main conclusions of the paper are drawn as follows:
(1)The application of the proposed GDF-ML framework to spatial interpolation is novel.Compared to the geostatistical methods,there is no need for tedious covariance model estimating and detrending.Compared with theXYfields based ML method,the prediction error(RMSE)of the GDF field is lower.The visualization results show that interpolation based on the GDF fields is closer to the true field,which is significantly smoother and more continuous than theXYfields.The GDF-ML framework can be easily extended to 3D cases and to combine additional information such as geophysical data by adding additional inputs.

Fig.14.Predicted CPT values under 2 and 10 samples per split.

Fig.15.Uncertainties of predicted qt field.

Fig.16.Comparison between the original and reconstructed qt profiles based on 5 CPTs.

Fig.17.Comparison between the original and reconstructed qt profiles based on 10 CPTs.
(2)The GDF-ML framework can be utilized when there is sparse to dense sample information available.The prediction performance is reasonably well even when only 3 CPTs are available.The model performance becomes more and more reliable as more data are available.When the number of CPTs increases,the accuracy increases rapidly first and slows down as the information becomes overlapped.
(3)The GDF-ET method is observed to be the most accurate method among the six tested methods.The advantages of this method include:high prediction accuracy;smooth and continuous prediction ability even for the highly nonlinear soil profiles;fast computation;input variables importance interpretable;robust in the hyperparameters setting;applicable to very small size of training dataset;flexible from interpolation to regression based on setting the hyperparameters of minimum samples per split;and able to quantify the associated uncertainty.
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This study was funded by the Australian Government through the Australian Research Council’s Discovery Projects funding scheme(Project DP190101592)and the National Natural Science Foundation of China(Grant Nos.41972280 and 52179103).
Journal of Rock Mechanics and Geotechnical Engineering2022年5期