Kaixun He*,Kai Wang,Yayun Yan
1 College of Electrical Engineering and Automation,Shandong University of Science and Technology,Qingdao 266590,China
2 Key Laboratory of Advanced Control and Optimization for Chemical Processes,Ministry of Education,East China University of Science and Technology,Shanghai 200237,China
Keywords:Near-infrared spectroscopy Chemical processes Process systems Soft sensor Gasoline blending
ABSTRACT Training sample selection is widely accepted as an important step in developing a near-infrared(NIR)spectroscopic model.For industrial applications,the initial training dataset is usually selected empirically.This process is time-consuming,and updating the structure of the modeling dataset online is difficult.Considering the static structure of the modeling dataset,the performance of the established NIR model could be degraded in the online process.To cope with this issue,an active training sample selection and updating strategy is proposed in this work.The advantage of the proposed approach is that it can select suitable modeling samples automatically according to the process information.Moreover,it can adjust model coefficients in a timely manner and avoid arbitrary updating effectively.The effectiveness of the proposed method is validated by applying the method to an industrial gasoline blending process.
Gasoline is one of the most profitable products of a refinery,which can yield most of its total profit[1].Before delivery,blending operation is the key stage to guarantee that gasoline products satisfy specifications[2].Fig.1 illustrates a typical blending system that blends all the component oils from the upstream process or component tanks into the finished product according to a certain recipe [3].Considering that the blended oil is pumped into product tanks directly without any other processes,the blended gasoline must meet specifications.Otherwise,it should be re-blended,which will cause considerable evaporation losses and delay the delivery of orders.To guarantee successful operation and improve profitability,the optimal control of blending system has been widely applied[4].Considering that optimal control mainly aims to optimize the blending recipe,the quality-relevant variables must be measured online.However,these primary properties can only be acquired via traditional offline laboratory analysis,which may introduce a considerable delay.
Accordingly,the near-infrared (NIR)spectroscopy analyzer has been widely used recently[5,6].The greatest advantage of NIR technology is that it can provide estimation results more rapidly than traditional method [7].The application of NIR technology is based on the development of modern chemometrics.Numerous modeling algorithms have been studied to establish the NIR model[8].Considering that the NIR data usually include hundreds of wavelengths,projection-based methods,such as principal component regression(PCR),partial least squares (PLS),and their extended versions,are generally adopted [9].With these common modeling algorithms,the NIR model is developed using historical samples before being launched in an online environment to provide prediction results.The main factors that affect the accuracy of the NIR model include the quality and representative of modeling samples and the maintainability of the modeling algorithm[10].The general method to develop a training dataset is to collect additional samples in the target conditions.Subsequently,the established NIR model can cover as much modes as possible,which allow it to improve the robustness and extend the prediction range.However,the training samples are mainly selected by trial-and-error method,which is a timeconsuming process that is difficult to execute online.To address this issue,methods based on similarity (generally defined by the Euclidean distance)can be adopted.The basic idea of such methods is to define a similarity index between the query sample and the historical samples.Then,the samples in the historical dataset with a large similarity index are selected into the final training dataset.Notably,the similarity-based method does not consider the interaction among all the selected samples when a model is established.Hence,the accuracy and robustness of the NIR model are the comprehensive embodiments of all the information of modeling samples,and obtaining high performance by using this method is difficult.Accordingly,in this work,an active training sample selection strategy is proposed.Our strategy includes three main steps:1)first,the most representative data for future conditions are selected from the historical dataset based on statistic rule;2)an objective function is defined according to a proposed weighted root mean square error (wRMSE);and 3)training samples are selected by using genetic algorithm (GA)to minimize the objective function.From the perspective of combinatorial optimization,the interaction among all the modeling samples is fully considered.Moreover,on the basis of the newly defined wRMSE,the proposed strategy can obtain the optimal training dataset.Considering that the whole selection process does not require considerable manual intervention,this strategy has high operating efficiency and is suitable for industrial application.

Fig.1.Typical gasoline blending system.
If the training dataset is determined,the NIR model can be established by the traditional modeling methods.As mentioned above,PCR and PLS can be used;however,their linear nature always leads to poor prediction accuracy,especially for processes with a large range of different states[11].Hence,nonlinear modeling algorithms,such as kernel PCR or PLS,artificial neural networks,least squares support vector regression[12],and Gaussian process regression[13],can be adopted as a straightforward strategy.Reference[7]provides a comprehensive comparative study of these nonlinear methods.Although the above linear and nonlinear algorithms have been widely applied and have achieved good results,time variation remains a challenge,because offspec products are not desired in industrial processes.Moreover,due to the static and global nature of all the above-mentioned algorithms,the NIR model cannot be updated and maintained in a timely manner[14].As a result,the performance can be jeopardized by varying time.To overcome this issue,various adaptive schemes have been proposed,such as bias updating,recursive adaptation,local weighted regression(LW),just-in-time learning(JITL),and moving window[15-17].Bias and recursive adaptation techniques are the widespread adaptive methods in the industry and have less computational complexity.However,determining how often the model should be updated is difficult[18].Moreover,both models cannot be applied during sampling interval,because the new reference value is an essential condition,which is usually obtained using analyzers or in the laboratory.To address this issue and handle abrupt changes,local learning strategy has gained considerable attention in both academic and process engineering areas.The JITL and LW strategies are typical representatives of local learning,which do not establish the model offline but launch local model for each query sample[19,20].The adaption of JITL does not rely on new sampling points.Rather,it reselects modeling samples from historical dataset according to the calculated similarity index.However,the number of local training samples is difficult to tune.
In this work,the LW strategy is adopted.Unlike in JITL,the training samples of LW are determined offline according to expert knowledge,and only the weight of each training sample is recalculated online[21].The advantage is that it can ensure that the structure of initial training dataset is not deteriorated by arbitrary updating when it handles abrupt changes online.The conventional LW strategy only emphasizes different weights on samples,while all variables are treated with the same importance.However,in the NIR spectrum,the signal-to-noise ratio of each wavelength differs.The informative wavelengths will improve the effectivity of the NIR model,whereas others may jeopardize it [22].Therefore,information-related variable-wise weight should be considered for each wavelength during the application of the LW strategy.To achieve this function,this work proposes an adaptive updating strategy,which considers both sample importance and wavelength importance for the NIR model.In combination with the proposed training sample selection strategy,the accuracy and robustness of the NIR model could be highly improved.
The remainder of this paper is organized as follows:In Section 2,brief introductions are given for traditional locally weighted partial least squares regression(LWPLS)and the basic procedures of training sample selection by using GA.In Section 3,the proposed active training sample selection method and model maintenance strategy are discussed in detail.Case studies in industrial application are provided in Section 4.Finally,the main contribution of this work is concluded in Section 5.
In this section,the derivation of LW regression and the procedures of training sample selection by GA are presented.
The fundamental viewpoint of locally weighted strategy was proposed approximately two decades ago[23,24].Considering that it can cope with changes in process characteristics and nonlinearity,this method has been widely applied in chemometrics,chemistry,and engineering.Many modeling algorithms such as PCR and PLS can be combined with LW to improve the adaptability.In this part,the details of LWPLS are introduced,which is the basic framework of our proposed model maintenance strategy.
The input and output data matrices are given by X=[X1,X2,…,XN]T∈?N×Mand Y=[y1,y2,…,yN]T∈?N×1,where N is the number of samples,M is the dimension of input variables,and the superscript T denotes the transpose of a vector or matrix.To develop LWPLS,the local weight matrix Ω should be defined in advance.Commonly,it can be calculated asThedenotes the weight between the ith training sample Xiand the query sample Xq.It is usually defined based on the Euclidean distance,which is given as follows:

where di,qis the Euclidean distance between Xiand Xq,std(di,q)is the standard deviation of di,q,and γ represents the localization parameter,which is usually defined by cross-validation.On the basis of the weight matrix Ω and PLS algorithm[25],LWPLS can be developed through the following procedures:
2)The weighted-mean variablesare calculated as follows:

3)All the samples are weighted-mean-removed according to the following procedure:

where 1N∈?Nis a vector of ones.
4)The covariance matrices Rxx and Rxy are computed.

5)The number of latent variables L is determined,and l=1 is set.
6)The weight vector wlis calculated.

7)rlis derived.

8)The lth latent variable and the loading vector of X are derived.


9)The covariance matrix Rxy is deflated as follows:

10)If l=L,then the regression coefficients are computed as follows

11)Otherwise,l=l+1 is set and step 6 is performed.
Generally,the number of latent variables L is determined by cross-validation.If l=L,then the output estimatecan be calculated.
GA is a general searching method that is proposed based on the concept of“survival of the fittest.”This method has been more frequently reported as a variable selection technique,but only a few studies used GA to optimize the structure of training datasets.The details and descriptions of GA can be found in Reference [26].In this part,we briefly introduce the idea of GA and present the specific steps to optimize the structure of modeling samples by using GA.
As an iterative procedure,GA maintains a constant-size population of candidate solutions.During the searching process,a new population of candidate solutions to an optimization problem is evolved by selection,crossover,and mutation.Then,the optimal individual of the current population will be retained in each iteration.Subsequently,the population is constantly being optimized.Commonly,the algorithm terminates when either a maximum number of generations has been produced or a satisfactory fitness level has been reached for the population.In addition,to solve an optimization problem by using GA,a fitness function,which is used to evaluate each individual,should be defined in advance.According to the above description,the steps of training sample selection by GA are as follows:
Step 1:The fitness function is defined.Commonly,the cross-validation RMSE of PLS is used to evaluate the performance of a certain modeling dataset.Therefore,the value of RMSE can be used as fitness.
Step 2:The GA is initialized,and the population size(psize),maximum number of iterations(iter),crossover probability(pc),and mutation probability(pm)are set.
Step 3:The initial population P0is generated randomly.Each individual in P0is a series of 0-1 integer variable,where,if a sample in historical dataset is selected as modeling sample,then the corresponding gene in the individual is set to 1.
Step 4:The fitness of each chromosome is calculated.
Step 5:Selection,crossover,and mutation operations are performed,and a new population is generated.
Step 6:The optimal individual of each iteration is retained,and the next step is performed.
Step 7:If GA converges,then the optimal individual Soptis provided.Otherwise,Step 4 is performed again.
Step 8:The optimal individual Soptis decoded,and the label of the selected modeling samples is obtained.
As mentioned above,the selection of modeling samples from the historical dataset is a combinatorial optimization problem.Therefore,we can use the binary code method to develop a GA.In addition,when calculating the RMSE,the specific part of the data needs to be predicted by the developed PLS with the selected modeling samples.It is a key factor that affects the searching results and will be further discussed in Section 3.
The estimation and generalization performance of the NIR model largely depend on the quality and quantity of training samples.Hence,these samples must be carefully determined to construct an accurate NIR model.Moreover,monitoring the time-variant industrial process with a global-based NIR model is difficult.To overcome these problems,an effective solution is presented in this section.For the first issue,an active training sample selection strategy is proposed.For the second one,an adaptive weight updating method is adopted and improved.
When establishing an NIR model for industrial processes,the general method is to collect more samples in the target conditions to form a training dataset to allow the established model to cover as many modes as possible.However,the range of training samples is inconsistent with the accuracy and robustness of the final NIR model.The robustness of the NIR model will be improved with a large amount of training samples,but the prediction accuracy will be reduced.By contrast,with a narrow range of modeling samples,providing the desired prediction performance for the entire process is difficult.In addition,considering the training samples'combined effect on the developed NIR model is necessary,because the estimation accuracy is a comprehensive representation of all modeling samples.Practically,the compromise of these problems and the optimal selection of training samples mainly depend on experienced engineers,which is a time-consuming process.Considering that the data collected from industrial processes basically follow normal distribution,future process condition can be inferred approximately from historical data combined with the production information.After that,on the basis of the inferred process conditions,training samples that lead to the minimum value of loss function can be selected using the optimization algorithm.From this point of view,an active training sample selection strategy is proposed.
To optimize the structure of the training dataset,the mean and variance of future data should be estimated first.Given the historical data information OrigSet={X ∈?N×M,Y ∈?N×1},the mean valueand standard deviation value stdY of outputs Y can be calculated as follows:

Considering that the output Y satisfies normal distribution,the steady-state distribution of future process data can be estimated approximatively by using the following equations:

here,k ∈[1,3]is an empirical parameter,it is mainly determined by the actual production information and operator experience.A small value of k corresponds to an increased likelihood that the future process condition will tend toward the center of the historical dataset.The yupand yloware the upper and lower bounds of the inferred future condition,respectively.Then,a subset of the historical dataset can be determined by

It represents the steady-state samples of future process conditions.And ObjSet is used to derive the mean and deviation value of the future condition.

where n is the number of samples in dataset ObjSet,and ObjStdY are the mean and standard deviation value of the ObjSet,respectively.
According to the above analysis,the representative data of future process can be obtained.With the above equations,in practical applications,a suitable dataset could be selected from the historical database when the production order is released.Then,GA can be applied.In addition,a loss function which is used as an objective function,should be designed.In general,the following defined cross-validation RMSE can be used as an objective function:

Commonly,a small RMSE corresponds to the good performance of the model.Considering that this index is calculated based on modeling samples,evaluating the performance of developed model for future process condition is difficult.To overcome this issue,an improved index named wRMSE is defined as follows:

where n is the number of samples in the dataset ObjSet.Moreover,κ=1,2,…,nl denotes which samples in ObjSet are not included in the PLS model,and t=1,2,…,n ?nl denotes which samples in ObjSet are included in the developed PLS model.Hence,the predicted value of the samples in ObjSet contains two parts.For the samples{Xκ,yκ},the prediction valueis calculated by leave-one-out cross-validation,because these samples have been included in the training dataset.For samples {Xt,yt},the predictionis provided by the developed PLS model directly.
The weight variable wiof the ith sample in ObjSet is defined as follows:

where γ is the localization parameter.Fig.2 shows that the weight widecreases steeply when γ is small and gradually when γ is large.According to Eqs.(20)and(21),the weights of wRMSE will be small if the outputs of samples deviate considerably from the mean value
This result can be easily interpreted in Fig.2.Thus,the optimization algorithm will focus on the prediction accuracy of these data,which are close to the mean valueduring the searching process.According to the abovementioned procedures,we can infer that searching optimal training samples from historical dataset is an unconstrained combinatorial optimization problem.The details to solve this problem are presented in Section 2.
LWPLS is introduced to establish an online adaptive model to handle time-variant and nonlinear problems.Moreover,a supervised method called adaptive weighted updating strategy is proposed to improve the adaptability of the traditional LWPLS.In LWPLS,the weight matrix Ω is generally defined by a sample-wise weighted method.In other words,emphasizes only samples with different weights,while all the variables in one sample are treated with the same importance.Considering that different variables may also play different roles in a quantitative analysis model,the variable-wise weight needs to be introduced.To address this issue,the following equation is adopted:

where Π=diag(Π1,Π2,…,Πm,…,ΠM).The diagonal matrix Πmis the weight of the mth variable,which is usually defined based on the regression coefficients of the PLS model.However,the effectiveness of the original weights may deteriorate due to the nonlinear and timevariant characteristics of the industrial process.Accordingly,the absolute value of the regression coefficients of RPLS is used to form the matrix Π in this work.Thus,the weight matrix Ω could be updated following the RPLS.Afterwards,LWPLS can be established for each query sample Xq.
In the proposed method,the RPLS algorithm is not adopted to calculate the predicted valuemainly because it cannot be effectively updated during the sampling interval,especially in the gasoline blending process,where the reference value yqis usually measured in a laboratory,which always leads to a large sampling interval.In our method,locally weighted and recursive strategies are combined to ensure that the performance of the developed NIR model will not be jeopardized by a large sampling interval.In addition,it can fully utilize the new sample(Xq,yq),to track the changes in process conditions.The detailed stepby-step procedures of the proposed updating method are summarized below.
Step 1:The training dataset{X,Y}is selected from the historical database according to the procedures mentioned in Section 3.1.
Step 2:Covariance matrices Rxx and Rxy are initialized by using the following Eqs.(5)-(6).
Step 3:The number of latent variables L by is determined crossvalidation.
Step 4:The RPLS model is established,and the variable weight matrix Π is generated.
Step 5:When query sample Xqis available,the distance di,qcan be calculated;then,LWPLS is applied to compute the estimation value
Step 6:When the reference data yqis available,the mean value Y and covariance matrices can be updated as follows:

where the updating formulations of Rxx and Rxy are deduced according to Reference[27].Using this method,the covariance matrices can be updated recursively with low computational complexity.

Fig.2.Characteristic curves of weight function.(Here,Err=and the left half of this figure depicts the Eq.(21)when ObjStdY=1,the right half depicts the weight w vary with Err.)
Step 7:Step 4 is performed.
From the abovementioned procedures,online updating can be operated through two basic levels in the proposed method:(i)during the sampling interval,the weight matrix Ω is updated according to the query sample Xq,while matrix Π remains unchanged;and(ii)when the reference data yqis available,the variable weight matrix Π can be updated,and the information about the current process can be introduced into the original LWPLS model.
The aforementioned training sample selection method can improve the performance of the NIR model by optimizing the structure of the training dataset.However,the modeling dataset also needs to be updated according to the characteristics of the new process condition,since the changes in operating conditions may jeopardize the representative of the selected modeling samples.In order to do this,the key issue is to determine the timepoint of reconstruction.In this work,a hypothesis-based approach is adopted to handle this problem.Given the following hypothesis test,


where n0and n1are thenumber ofcurrent processsamples andmodeling samples,respectively.are the samplevariance of training dataset and process sample set,respectively.
If the null hypothesis is rejected,then the training sample should be re-selected by the proposed method.At this moment,is set,and the historical samples subject toare selected to develop the new ObjSet,where err is the predefined accuracy.Then,the training sample selection algorithm could be re-applied.
A flowchart of our proposed active training sample selection and model updating strategy is illustrated in Fig.3.
On the basis of the proposed method,the training dataset can be initialized automatically.Subsequently,with the use of LWPLS,a local NIR model can be constructed for online prediction.When the query data Xqis obtained,the first step is to initialize RPLS to generate Π.Then,the weight-matrix Ω can be calculated.Afterwards,LWPLS is available,and the final prediction is provided in the following form:

where b0and bmare the regression coefficients as calculated by LWPLS.When the reference value yqis available,the mean valueand standard deviation value stdY are updated as follows:

Then,ObjSet can be updated according to Eqs.(15)and(16).With the updated ObjSet,the training sample selection algorithm could perform automatically.This procedure could not only improve the adaptiveness of the established model,but also greatly reduce the workload of manual maintenance.

Fig.3.Flowchart of the proposed algorithm.
In this section,the proposed intelligent training sample selection and weighted updating PLS(ITRW-PLS)is applied for the online prediction of the research octane number(RON)of gasoline in an actual blending process.Four modeling approaches,namely,PLS,RPLS,LWPLS,and JIT-PLS,are adopted for comparison.The details of these algorithms are as follows:
1.PLS:The structure and coefficients of the PLS model are determined offline and remained unchanged.
2.RPLS:The model coefficients of RPLS are updated in each iteration when a new sample pair(Xq,Yq)is available.
3.LWPLS:A local model is established for each query sample Xq,and different weights are assigned to training samples.The details of LWPLS are presented in Section 2.1.
4.JIT-PLS:A local model is established with local training samples whose Euclidean distances are near Xq.In this study,Eq.(31)is used to select the local training samples for JIT-PLS.

where ?is the distance threshold.
Considering that PLS is the basic modeling algorithm in all mentioned methods,for a fair comparison,the same condition of latent vectors L was applied for all methods,which was optimized by tenfold cross-validation.The threshold ?in JIT-PLS was also determined by cross-validation in the modeling dataset.The localization parameter γ is included in LWPLS and ITRW-PLS.This parameter determines the local weight of each modeling sample when LWPLS is developed.However,the value of this parameter restricts the amplitude of the index wRSME,which will affect the structure of the selected training dataset when the proposed training sample selection procedures are conducted.In this work,the optimal value of γ is searched from 0 to 1 with a step size of 0.005.The parameter k in ITRW-PLS depends on the production information and the engineers'experience.A large k value indicates that the future process condition will cover a large scope of the historical dataset.To compare the accuracy of these methods,RMSE and correlation coefficient R2are employed for performance evaluation.

Conventionally,a small RMSE indicates high predictive precision of the model,and a large R2indicates good interpretability of the model.Considering that the large deviation of the predicted value could cause drastic fluctuation of the blending recipe,which will lead to an unqualified product,the number of samples exceeding the pre-defined accuracy error (denoted as NSE)is adopted to the evaluate models'performance.
The computer configurations for the experiments are as follows:OS:Windows 8.1 (64-bit);CPU:AMD A8-7100 Radeon R5 (1.90 GHz);RAM:6.95 GB.All algorithms mentioned in this work were performed using MATLAB 2012b.
A total of 460 gasoline samples were collected from a refinery in China.The first 322 samples of the series were used as testing data,and the other 138 samples were treated as historical data.The spectral range of gasoline samples was restricted to 1100-1300 nm with a nominal resolution of 1 nm,and the RON values were measured using the standard ASTM testing methodologies.
In this case,the err of RON was set at 0.5 according to process knowledge.The initial parameters of the selection algorithm for the training sample are tabulated in Table 1.According to the relationship between RMSE and latent number of PLS on the original training dataset(illustrated in Fig.4),the parameter L was set as 5.

Table 1 Parameters of training sample selection algorithm
The location parameter γ of LWPLS and the threshold of JIT-PLS were set as 0.025 and 0.005 by cross-validation,respectively.For a fair comparison,the same parameters were adopted in all comparison tests inthis study.On the basis of Eqs.(24)and(25),the values of and ObjstdY were 93.87 and 0.49,respectively.The optimal training samples were obtained in the 127th iteration,the optimal value of wRSME was 0.1217,and the running time of GA was 360.51 s.According to the GA results,121 samples were selected.Obviously,this value is considerably smaller than the number of original historical samples.The location of selected points in the historical dataset is displayed in Fig.5.
To testify the efficacy of the proposed training sample selection strategy,PLS,LWPLS,and RPLS were developed on the original and selected dataset.The prediction results are shown in Tables 2 and 3.
In view of RMSE and R2,the models (PLS,LWPLS,and RPLS)established by the selected training dataset have a slight advantage over the traditional method.The main reason is that the initial historical dataset already covers most of the future process conditions.Moreover,the selected dataset retains this useful information but cannot provide additional process information without new sampling points.However,in terms of NSE,the performance of NIR models developed with the selected dataset is much more powerful,because the existence of highlevel points in the training dataset increases the instability of the NIR model.In the selected dataset,Fig.5 shows that most of the high-level points were removed,and the remaining samples contain the major information of the target process.The validity of the proposed training sample selection method has thus been demonstrated to provide a robust modeling dataset.It also proves the necessity and importance of training sample selection.To evaluate the performance of our proposed updating strategy,all mentioned algorithms were developed based on the selected training samples.The parameter n0of ITRW-PLS was set at 25 empirically.The large value of n0could lead to the low updating frequency of the training dataset.Therefore,in practice,we should compromise between the update frequency and the model accuracy when determining the value of n0.

Fig.4.Relationship between RMSE and latent number of PLS.

Fig.5.Selected modeling samples in the historical data of gasoline.(For proprietary reasons,the property values were normalizEd.)
At the 185th sampling time,the null hypothesis was rejected.At this moment,=93.209and the ObjSet were re-developed based on the current historical dataset.Moreover,a new set of training samples was selected by GA after 290.65 s.In an actual blending process,the sampling interval of the NIR analyzer is 3 min,that is,the training dataset could be updated after approximately two sampling cycles.
A detailed comparison of PLS,RPLS,LWPLS,JIT-PLS,and ITRW-PLS is illustrated in Figs.6 and 7.The prediction accuracies are reported in Table 3.Apparently,the local learning methods(LWPLS,JIT-PLS,and ITRW-PLS)outperform the global learning methods(PLS and RPLS).In terms of RMSE and R2,ITRW-PLS achieves the best performance among all the models.This finding shows that the proposed ITRW-PLS has satisfactory prediction accuracy.According to the evaluation index NSE,ITRW-PLS has an obvious advantage,namely,better robustness.In this case,the performance of RPLS was not satisfactory,although its regression coefficients can be updated recursively.The inferiority of RPLS is mainly due to the arbitrary updating by using all the sampling points.The addition of redundant samples will increase the computational complexity,whereas the introduction of high-level samples will lead to an abrupt error in the prediction value,which can be clearly seen in Fig.6.
On the basis of the similarity index,JIT-PLS selects the local modeling samples according to each query sample.This method cannot fully consider the interaction between modeling samples,and no standard principle can be used to determine the number of local modeling samples.These conditions lead to the unstable characteristic of the number and structure of each local modeling dataset.The locally weighted-based LWPLS is an effective modeling method that performs better thanRPLS and JIT-PLS.Considering that the modeling samples of LWPLS were optimized by GA in advance,the model could have improved accuracy and stability as long as the process condition does not drift greatly.However,its accuracy is still not as good as that of the proposed algorithm.The prediction curve of ITRW-PLS is depicted in Fig.7,where the prediction is close to the actual value,and only four samples exceed the maximum error.The main advantage of ITRW-PLS is that it updates both the sample-wise and variable-wise weights recursively.Moreover,unlike the RPLS algorithm,the recursive updating strategy is used to update the variable weights in ITRW-PLS.This strategy not only prevents excessive updating,which is commonly found in RPLS,but also adaptively introduces new information about the process.More importantly,the introduction of hypothesis test during the online process can detect changes in process conditions on time and guide the NIR model to be updated actively.Fig.7 shows that at the 185th sampling time,the training dataset was forced to be reselected according to the result of the hypothesis test.Also,the number of testing samples with a large prediction error is considerably reduced after the training dataset was updated.These results explain why ITRW-PLS outperforms LWPLS and other tradition modeling methods in an actual blending process.Moreover,the training sample selection and model updating are the key for the long-term application of the NIR model.

Table 2 Performance of algorithms modeled by the original historical dataset
This paper expounds on the importance of a modeling algorithm for the NIR analyzer in the gasoline blending process and points out that the traditional modeling and updating strategies are insufficient to construct a highly accurate NIR model for long-term industrial application.In addition,in the NIR modeling process,using an optimized training dataset is better than using the whole historical data.Hence,this work introduces an active training sample selection and adaptive weighted updating framework to establish the NIR model.With the help of the proposed training sample selection method,suitable modeling samples can be determined efficiently and automatically,which could greatly reduce the modeling time.With the use of the proposed updatingstrategy,the adaptiveness of the NIR model is improved,and excessive and improper updates are prevented.Applications to an actual industrial dataset indicate that the proposed method outperforms the recursive-based,local weighted-based,and JIT-based algorithms.Although we applied the LWPLS in this article,the intelligent training sample selection and updating framework can be easily extended to other modeling approaches.Further work focusing on reducing the computational complexity of the training sample selection procedures is being planned.We believe that our proposed strategies will further reduce the dependence of modeling on engineers and improve the precision of the NIR model.

Table 3 Performance of algorithms modeled by the selected training dataset

Fig.6.Prediction error curve of RON by PLS,RPLS,and LWPLS.(Points exceeding the maximum error limit are with a red circle.)

Fig.7.Prediction error curve and bias of RON by ITRW-PLS.(For proprietary reasons,the property values were normalizEd.)
Nomenclature
b regression coefficients
di,qthe distance between Xiand Xq
err the pre-defined error
iter maximum number of iterations in GA
L number of latent variables in the PLS calculations
M number of variables in database X
N number of samples in database X
nlthe number of samples in ObjSet and be not used in the PLS model
n0number of the current process samples
n1number of the modeling samples
ObjSet selected objective dataset from historical database
ObjStdY standard deviation of ObjSet
P0initial population
pccrossover probability
pmmutation probability
psizethe population size of P0
Rxx covariance matrix of X(M×M)
Rxy covariance matrix of X and y(M×1)
Soptthe optimal individual
stdY standard deviation of Y
X input data matrices
Xithe ith sample in database X
Xqinput of query sample
Y output data matrices
yithe ith sample in database Y
ylowthe lower bounds of the inferred future condition
yupthe upper bounds of the inferred future condition
withe weight variable of ith sample in ObjSet
γ location parameter in the LWPLS
Π variable weight matrix(M×M)
Ω sample weight matrix(N×N)
? distance threshold in the JIT-PLS
Chinese Journal of Chemical Engineering2019年11期