Data cleaning method for the process of acid production with flue gas based on improved random forest

2023-10-19 10:19:18XiaoliLiMinghuaLiuKangWangZhiqiangLiuGuihaiLi

Chinese Journal of Chemical Engineering 2023年7期

Xiaoli Li, Minghua Liu*, Kang Wang Zhiqiang Liu, Guihai Li

1 Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China

2 Beijing Key Laboratory of Computational Intelligence and Intelligent System, Engineering Research Center of Digital Community, Ministry of Education, Beijing 100124, China

3 Guixi Smelter, Jiangxi Copper Co., Ltd, Guixi 335400, China

4 Beijing RTlink Technology Co., Ltd, Beijing 100102, China

Keywords:

ABSTRACT

1. Introduction

With the rapid and steady development of the economy, the demand for metal resources is increasing. However, the smelting process of metal produces a large amount of sulfur-containing flue gas. With the increasing of people’s awareness of environmental protection,it is particularly important to control the flue gas emitted from metal smelters. In order to protect the environment, sulfur in flue gas is often recycled by producing sulfuric acid.

Acid production with flue gas is a complex nonlinear process with multiple variables and strong coupling. The data of the operation process is an important basis for the status monitoring,operation optimization control, and fault diagnosis of the acid production with flue gas process [1–3]. It is the information basis to improve the production efficiency and production level of sulfuric acid [4]. The operation environment of acid production with flue gas process is complex. There is a lot of equipment and the coupling of each link is strong. Therefore, the data obtained by the detection equipment is seriously polluted and prone to abnormal phenomena such as data loss and outliers.It is quite difficult to analyze and process the data of acid production with flue gas process. Therefore, it is of great significance to accurately eliminate outliers in the data and supplement the missing data for subsequent modeling and controlling of acid production with flue gas process.

To solve the problem that it is difficult to identify the characteristics of outliers in data, a variety of abnormal data recognition methods have been proposed widely. There are some methods based on probability distribution, density, the distance between data points, etc. Specific methods include pauta criterion, quartile method, DBSCAN clustering [5], and so on. Li et al. [6] use pauta criterion to clean the data in the production process of slag powder. If the absolute value of the difference between the data and the sample average value is greater than 3 times the standard deviation of the sample,the data will be judged as an outlier.The study finds that the model established by the data cleaned by this method has higher accuracy.Zhao et al.[7]use kernel learning theory and robust regression analysis to obtain Gaussian normal distribution data and the study shows that this method effectively suppresses the interference of abnormal data. Zhai et al. [8] use fuzzy c-means clustering algorithm for anomaly traffic detection,then use an adaptive regression method for feature extraction,and design the optimal clustering strategy. The study shows that the anomaly detection algorithm has a high detection ability. Tian et al. [9] use the sliding window method to detect anomalies in time series data, segment time series by the sliding window method, and complete the division of outliers and normal values according to the confidence interval distance radius. Fei et al.[10] use the k-means algorithm to detect outliers in wireless sensor network data and distinguish normal data from abnormal data according to the distance between data points and clustering centers. Han et al. [11] use the isolation forest algorithm to identify and eliminate outliers in the data of municipal wastewater treatment process,thus realizing the elimination of outliers in wastewater data.In the above outlier handling methods,the method based on probability distribution is only suitable for the data with known distribution characteristics. And the method based on clustering algorithms can only find the global outlier of data. However, this method is difficult to identify the abnormal features of local data and cannot process data sets with regions of different densities.The isolated forest algorithm can deal with the elimination of abnormal data in the wastewater treatment process controlled by multi-variable coupling and has a good effect. Therefore, it is also suitable for the process of acid production with flue gas.

Aiming at the problem of compensation for missing data after outlier elimination, a variety of data compensation methods are widely used, including interpolation, support vector machine regression [12], BP neural network [13], and so on. Chai et al.[14] use optimal weighted average interpolation to compensate for the mixed data. The proposed method balances the influence of historical data and adjacent data comprehensively and realizes asynchronous measurement data fusion. However, the above method relies too much on the quality of historical data and adjacent data and cannot accurately compensate for the abnormal data in any cluster. Lv et al. [15] propose a method based on the maximum variance weight information coefficient to compensate for the coal gas data of metallurgical enterprises. The research shows that this method significantly improves compensation accuracy.Pang et al. [16] propose a data-driven network compensation method for data, which calculates the control delta according to the output error and then designs the data compensation strategy according to the control delta. Gu et al. [17] establish a data compensation model using BP neural network and add artificial bee colony algorithm. The research shows that this method can improve the speed and accuracy of data compensation. Zhu et al.[18]use the random forest regression model optimized by the fruit fly algorithm to predict wind speed,and the research finds that the proposed algorithm has higher prediction accuracy.Feng et al.[19]propose adversarial smoothing regularization (ASR) and the method regulates the learning model to be robust against local perturbation in a semi-supervised manner.Furthermore,the adversarial smoothing tri-regression (ASTR) model is proposed for soft sensor with the information of unlabeled samples with pseudo labels. Yu et al. [20] propose a method denoising autoencoder and elastic net (DAE-EN) to solve the process nonlinearity and noise interference in the process of process monitoring and fault isolation.Research shows that the proposed method can effectively detect the abnormal samples in industrial processes and accurately isolate the faulty variables from the normal ones. Using a neural network to compensate for data needs to ensure the validity of training data and other algorithms are also needed to assist judgment. In the actual process of acid production with flue gas, the abnormal data not only contains the abnormal characteristics of a single variable but also contains the synchronous or asynchronous characteristics of multiple variable data. Random forest regression model has a better fitting effect for wind speed data with fluctuation and randomness. Therefore, the fitting prediction for the data of acid production with flue gas is also applicable.Meanwhile, if the random forest algorithm is improved, its fitting prediction effect will be better.

In order to improve the random forest algorithm, its hyperparameters can be optimized.Commonly used hyperparameter optimization methods are grid search, random search, and Bayesian optimization.Grid search determines the optimal value by finding all points within the search range.However,when there are many hyperparameters to be tuned and the amount of data is large, this method consumes a lot of computing resources and is inefficient.Random search is to sample a fixed number of parameter settings from a specified distribution. It finds the optimal value by randomly sampling within the search range. Random search is faster than grid search. However, since the sampling is random, the results obtained by random search are not necessarily globally optimal. Bayesian optimization establishes a probability model based on the past evaluation results of the objective function.Then the value that minimizes the objective function according to the probability model can be found. However, the cost of evaluating the results in Bayesian optimization is very high. The probability agent model has many assumptions. When the number of attributes is large or the correlation is strong,the decision has a certain error rate. The effect is not very good.

Aiming at the analysis of the above problems, this paper proposes a data cleaning method based on improved random forest to realize the cleaning of abnormal data in the process of acid production with flue gas. Firstly, an abnormal data identification model based on isolated forest is designed to identify and eliminate outliers in the data of acid production with flue gas process.Secondly, an improved random forest regression model is established to achieve the fitting and prediction of data change trend.Finally, the improved random forest regression model is used to compensate for the data set after the outliers are removed.

The rest of this paper is organized as follows. Section 2 describes data sources and characteristics. Section 3 clarifies the methodology and introduces the proposed algorithm named improved random forest. Then, Section 4 presents and discusses the experimental results. Finally, Section 5 concludes this paper.

2. Data Sources of Acid Production with Flue Gas

The real data of acid production with flue gas in a copper plant in Jiangxi province in March 2021 is selected as the experimental data.The computer collects data every minute.Every day 1440 sets of data are collected at each data point. Therefore, the data set of acid production with flue gas process has a large capacity. It is of great significance for subsequent modeling and controlling.

Fig. 1 shows the industrial process of acid production with flue gas. Copper smelting in flash furnace produces flue gas containing SO2.The flue gas enters the heat exchanger through the fan for heat exchange.Then the flue gas enters the first,second,and third floor of the converter for catalytic conversion into SO3.After that,the gas containing SO3goes into the first absorption tower to produce sulfuric acid. The remaining flue gas enters the fourth floor of the converter through the heat exchanger. After catalytic conversion,the flue gas enters the second absorption tower to produce sulfuric acid.

Fig. 1. The process of acid production with flue gas. In this process, the conversion of SO2 to SO3 is completed. Then SO3 is absorbed to produce sulfuric acid (see details in Section 2).

The whole industrial process includes dozens of variables,such as fan outlet oxygen concentration, fan outlet pressure, fan inlet flow, and so on. However, the instability of flash furnace smelting process leads to great changes in subsequent flue gas flow and SO2concentration, resulting in the instability of SO2catalytic conversion process. Therefore, many sensors need to be placed throughout the production process. The variable data comes from sensors throughout the production facility. In the process of production,pipeline pressure and corrosive gas often damage the sensor probe, resulting in inaccurate data acquisition. The data collected will be abnormal if the sensor is damaged once or for a period.In addition,the external disturbance and parameter disturbance also cause the abnormal production process. As a result,intermittent or continuous abnormal data may appear in the data. Abnormal data often leads to inaccurate modeling, which further leads to unstable control and cannot obtain an ideal control effect.

3. Methodology

3.1. Abnormal data eliminating algorithm

3.1.1. Anomaly data identification

In order to identify abnormal data quickly and accurately, this paper establishes an abnormal data identification model based on isolation forest (IF). Isolation forest is an anomaly detection method that divides outliers by specified rules and judges according to the times of division. It is an integrated decision tree learning method with linear time complexity and high accuracy. IF is suitable for anomaly detection of continuous data.It firstly divides data randomly and quickly identifies abnormal data according to the difference between abnormal data and normal data. IF does not need to build a model for normal data. It can find abnormal data explicitly, which has no requirements on data distribution,low computational complexity, and low memory requirements[21].

Abnormal data usually has two characteristics: small quantity and great difference from normal data.IF algorithm isolates abnormal data according to the above characteristics and further removes outliers. In the process of acid production with flue gas,data is collected by sensors at various sites. Most of the data is obtained from the normal operation process, and only a small amount of data is inconsistent with other data laws. At the same time, the data of each variable is correlated. IF algorithm does not need to calculate distance or density to find abnormal data.And it can find the abnormal data explicitly in the process of acid production from flue gas without modeling the normal data.When constructing a binary tree structure,abnormal data is closer to the root, while normal data is farther and deeper from the root. Fig. 2 shows that the IF model consists of n isolated trees. The specific algorithm [22] details are as follows.

Definition:isolation tree.T is assumed as a node in an isolation tree.T is either a leaf node with no child,or an internal-node with only two child nodes (Tl, Tr). Each segmentation includes an attribute q and a split value p. The data points are divided into Tland Traccording to the test q < p.IF algorithm divides the sample points by multiple dichotomies until each sample point or very few sample points are divided into the same area. Normal data needs to be divided many times and is in a high-density area. Abnormal data needs to be divided a few times and is in a low-density area [11].

Algorithm 1: iForest(X, n,ψ)Inputs: X – input data, n – number of trees,ψ –subsampling size Output: a set of n iTrees 1: initialize Forest 2: set height limit l=ceiling log2Ψ()3: for i = 1 to n do 4:X′ ←sample X,Ψ()5:Forest ←Forest ∪iTree (X′,0,l)6: end for 7: return Forest

Fig. 2. Isolation Forest model. In this model, the sample data is recursively divided from the root nodes to the leaf nodes until the stop condition is met (see details in Section 3.1.1). Under this random dividing strategy, outliers usually have short paths and can be isolated easily.

Algorithm 2: iTree(X, e, l)Inputs:X–input data,e–current tree height,l–height limit Output: an iTree 1: if e ≥l or |X |≤1 then 2: return exNode Size ←|X |}{3: else 4: let Q be a list of attributes in X 5: randomly select an attribute q ?Q 6: randomly select a split point p from max and min values of attribute q in X 7: Xl ←filter X,q

3.1.2. Abnormal data eliminating

After the data is processed by IF model, high-density regions and low-density regions are formed. The density regions of the data are represented by calculating outlier scores, and the data with high scores are eliminated.Details of the PathLength function are as follows.

Definition:path length.The path length h(x)of sample point x is the number of edges traversed by x from the root node to the leaf node of the isolation tree.

Any anomaly detection methods require an anomaly score[22].The anomaly score of sample points in isolated forests is calculated as follows.

Algorithm 3: PathLength(x, T, e)Inputs: x – an instance, T – an iTree, e – current path length;to be initialized to zero when first used.Output: path length of x 1: if T is an external node then 2: return e + C(T.size) {C(?) is defined in Eq. (2)}3: end if 4: a ←T.splitAtt 5: if xa

The path length of xijoutput by the IF model is hij, and xijis an element in the training data set.The outlier score of xijis calculated by hij. The formulas are as follows.

where u is the number of samples and C(u) is the average path length of all data in the training set. H(i) is the harmonic number and it can be estimated by ln(i) + 0.5772156649 (Euler’s constant).E(hij) is the average path length of data xijin n isolated trees.

In Eq. (1):

According to the anomaly score S, the following assessment is carried out. Outliers are removed from the data set according to the anomaly score for each data.

(1) if the value of S(hij,u)approaches 0.5,it is not obvious to distinguish whether the data is an outlier.

(2) if the value of S(hij,u) approaches 0, the data is judged to be normal.

(3) if the value of S(hij,u) approaches 1, the data is judged as an outlier.

3.2. Missing data compensation algorithm

3.2.1. Random forest regression

In order to solve the problem of compensation after eliminating outliers in the data set, this paper proposes an improved random forest regression algorithm to compensate for the data. Random forest (RF) regression is an integrated regression tree learning method, which has the advantages of not being easy to fall into over-fitting, strong anti-noise ability and fast training speed [23–25].

Fig.3 shows the flow of RF regression model.The RF regression model of this algorithm is a combination model composed of a regression trees, as shown in Fig. 4. The predicted value obtained by the algorithm is the average value of the target variable of the leaf node. The specific steps are as follows:

Step1:A subsample matrix is extracted from the training matrix T as the training sample of the regression tree, whose size is the same as that of the training matrix.

Step2: If the characteristic dimension of each sample is M, a constant m is specified and m ?M.m feature subsets are randomly selected from M features, and the optimal one is selected from these M features when each time the regression tree splits.

Step3: Each tree grows as much as it can and there is no pruning. Each tree does not stop growing until it reaches height.

Fig. 3. The flow of random forest algorithm (In this algorithm, multiple random trees are trained independently to generate multiple models. Multi-model prediction results are generated at last.This method outperforms the prediction results of a single model).

Step4:Repeat the above steps to complete the construction and training of a regression trees.

The prediction result of the regression tree is the mean value of the set within the leaf node after partitioning. The output of the random forest regression model is as follows:

Different variables are compensated simultaneously when RF regression algorithm is used to compensate for missing data.According to the idea of ensemble learning,RF learns the relationship between labels from the feature matrix.If the data of a single feature is missing, the prediction and compensation will be made according to the relationship between labels. If multiple features are missing at the same time, all features will be traversed and the feature with the least missing value will be compensated.When filling a feature, the missing value of other features is replaced with zero. After each compensation, the predicted value is put into the original feature matrix, and then the next feature is compensated. After several times of compensation, there will be no missing value in the data,and the missing data set will complete the compensation. Finally, the valuable data set is obtained.However, when the data of all variables are missing at the same time, the mean value of the preceding and following moment can be used to fill in 1 or 2 variables, and then the RF regression algorithm can be used to compensate for missing data. This is done to ensure maximum data integrity and get a valuable data set.

An important advantage of random forest is that there is no need to cross-verify the algorithm or use an independent test set to obtain an unbiased estimate of the error. The algorithm can be evaluated internally and an unbiased estimate of the error can be established during model generation. The convergence of random forest model ensures that the algorithm cannot appear the overfitting phenomenon.

The random forest model can be described as the following classifier:

The prediction vector X is input and the output is Y.The margin function is defined as follows.

where I(?) is indicator function and avkis to calculate the average value. The edge function represents the difference between the average number of votes for classifying input vector X correctly into the correct category vector Y and the maximum number of votes for classifying input vector X incorrectly into other wrong category vectors by ensemble classifier. The higher the edge function value is,the higher the classification reliability is.It shows that the stronger the generalization ability of the model and the stronger the prediction ability.

The generalization error of combinatorial classifier is defined as follows.

In Ref. [26], Breiman proves by the law of large numbers that the generalization error PE* satisfies a convergence theorem as the number of decision trees increases.When the number of decision trees of the random forest model tends to be positive infinity,the corresponding generalization error of the model satisfies the following convergence relation.

Fig.4. Random Forest regression model.In this model,a random trees are trained and integrated by the idea of ensemble learning.Random Forest integrates the regression results of all random trees and solves a single prediction problem by combining models.

Fig. 5. Genetic algorithm flow (Based on the theory of evolution, genetic algorithm generates the next generation of solutions through chromosome selection, crossover,mutation, and other operations. Through generation after generation of natural selection, the optimal solution is gradually obtained).

Fig.6. Industrial equipment photos:(a)converter;(b)desulphurization tower.These are equipment photos taken after our field visit to the cooperative enterprise and data collection.

Table 1 Production data of acid production with flue gas

where k is the number of decision trees and θ is distributed probability random vector. Eq. (6) shows that if the number of decision trees increases infinitely, the generalization error of random forest will have an upper bound.In other words,when the number of decision trees in the random forest model is large enough, the generalization error of the model is a fixed value. Therefore, the random forest can also achieve a good prediction effect on the new sample set, and there will be no over-fitting situation.

3.2.2. Improved random forest regression

In the traditional random forest regression algorithm, parameter settings are based on manual experience or default values. In order to compensate for the missing data better,the RF regression algorithm is improved. Genetic algorithm(GA)is used to optimize the important hyperparameters of the RF algorithm to obtain the optimal combination of parameters in the search space. An improved random forest(IRF)algorithm with stronger adaptability is obtained. In RF algorithm, the parameter’s value directly affects the prediction accuracy of the algorithm. For the optimization of important parameters of the algorithm,finding the optimal parameter combination is beneficial to improve the prediction accuracy of the model.

GA was first proposed by John Holland in the United States,which could simulate the natural selection of Darwin biological evolution theory and the biological evolution process in genetics[27].It is a method to search for an optimal solution by simulating the natural evolution process [28]. GA starts from the string set of the solution of the problem, not from a single solution [29]. It avoids falling into the local extremum and is beneficial to global optimization [30].

GA is one of evolutionary algorithms, which seeks the optimal solution by imitating the mechanism of selection and heredity in nature. In the long evolutionary process, lower organisms evolved into higher organisms, including replication, hybridization, mutation, competition, selection, etc. This optimization process lays the theoretical foundation for GA. Numerical methods find the optimal solution by iterative operation.The solution of this method is easy to fall into the local extreme point and it causes the phenomenon of an ‘‘infinite loop” so that the iteration cannot be carried out. GA can overcome this shortcoming. GA is a global optimization algorithm with good global search ability. It can quickly search out all the solutions in the solution space. At the same time,GA starts to search from the group.It has the potential parallelism and can carry out distributed computing.Therefore,the solution speed of GA is fast and the robustness is strong. As a result,GA is used to improve the traditional RF algorithm and optimize the hyperparameters of RF, which can quickly find feasible solutions in the solution space to make decisions.Fig.5 shows the flow of genetic algorithm. The specific steps are as follows.

Fig.7. Variable data curves:(a)fan outlet oxygen concentration;(b)fan outlet pressure;(c)fan inlet flue-gas temperature;(d)power wave scrubber inlet gas pressure.Four variables are selected to do the data cleaning experiment. The variation trend is not stable, indicating that the data contains large noise and the data quality is very low.

Table 2 Outlier elimination results

Step1: Determine the value range of fitness function, the number of iterations, mutation rate, chromosome coding length and other parameters.

Step2: The population is initialized. The individuals are coded.The first-generation population is randomly generated.

Step3:The R2of RF regression model is used as fitness function and the fitness of individuals in the population is calculated. Then select, cross, mutation and other operations are carried out to get the next generation population.

Step4: Terminate condition is judged to judge whether the number of iterations reaches the maximum value. If not, return to step3 to continue the calculation; otherwise, stop the calculation and output the optimal combination of tuning parameters.

There are 17 parameters affecting the performance of RF algorithm. Considering the running time and efficiency of the algorithm, this paper optimizes four parameters with the greatest influence.

n_estimators: This is the maximum number of iterations. That is the number of regression trees included in the RF model. The value of this variable has a great influence on the random forest regression algorithm. If the value is too small, the model will not fit properly and the prediction results of the model will be poor.With the increase of its value, the accuracy of the algorithm will continue to improve,but this operation will lead to a long calculation time and low efficiency.

max_depth: This is the maximum depth of the regression tree.The random forest does not limit the value of this variable, which is usually taken to default to NONE.The default is better for situations where features are insufficient or the amount of data is small.Therefore, it is necessary to adjust the value of this variable to adapt to the model in case of a large amount of data and many features.

Fig.8. Missing data distribution of different variables:(a1)fan outlet oxygen concentration data before eliminating abnormal data;(a2)fan outlet oxygen concentration data after eliminating abnormal data;(b1)fan outlet pressure data before eliminating abnormal data;(b2)fan outlet pressure data after eliminating abnormal data;(c1)fan inlet flue-gas temperature data before eliminating abnormal data; (c2) fan inlet flue-gas temperature data after eliminating abnormal data; (d1) power wave scrubber inlet gas pressure data before eliminating abnormal data; (d2) power wave scrubber inlet gas pressure data after eliminating abnormal data. IF algorithm is used to identify and eliminate outliers from the original data, and the missing values are replaced by zero.It is found that IF algorithm can effectively identify the outliers according to the data change trend before and after eliminating the outliers.

Table 3 Genetic algorithm optimization results

min_samples_leaf: This is the minimum number of samples required for leaf nodes. This variable is an important basis for determining whether leaf nodes should be pruned.When the number of leaf node samples is less than the value of this variable,the leaf node is pruned. In general, the default value for min_samples_leaf is 1, but only when the sample size is small. If the amount of sample data is large,the value of min_samples_leaf will need to be adjusted.

min_samples_split: This is the minimum sample number of internal node splitting. This variable is the main basis to judge whether the regression tree should continue to be divided. When the number of node samples is less than the value of min_samples_split,the regression tree stops splitting.In general,the default value of min_samples_split is 2, but this is only for small sample sizes.If the amount of sample data is large,the value of min_samples_split will need to be adjusted.

4. Experiments and Analysis

The data comes from real factory operation process. Fig. 6 shows the photos of factory equipment.And the Table 1 shows part of data of operation process. This paper selects the data of four variables as examples. First of all, the curves of original data are obtained. The data changes of the four variables are analyzed, as shown in Fig. 7.

Fig. 7 shows the variation trend of four variables, including fan output oxygen concentration,fan outlet pressure,fan inlet flue gas temperature and power wave scrubber inlet flue gas pressure.The 4 variables are selected for data cleaning experiment.The test set is the last 100 groups of data in 500 groups and the rest data is train set. The variation trend is not stable, indicating that the data contains large noise and the data quality is very low. Therefore,there is abnormal data in the original data set.

Fig. 9. Variables prediction results of IRF, SVM and RF: (a) fan outlet oxygen concentration; (b) fan outlet pressure; (c) fan inlet flue-gas temperature; (d) power wave scrubber inlet gas pressure.After comparison,it is found that the prediction results of IRF are better than the results of SVM and RF.The prediction result of IRF compensation value fits the trend of actual data best.

Because of the presence of abnormal data, it is necessary to identify and eliminate outliers. To prove the effectiveness of IF algorithm in dealing with outliers, variance (Var) is used as a performance indicator to analyze the experimental results. The results are compared with those of the quartile method and DBSCAN. The calculation method of Var is shown in Eq. (7).However, different working conditions lead to differences between the data, so variance cannot represent the quality of the dataset and can only be used to analyze the outlier processing process. Table 2 shows the variance of the original data and outlier processing experiment. After processing outliers by different methods, the variance of the dataset is reduced, but in the experimental results IF algorithm has a smaller variance. IF algorithm has smaller dispersion of dataset after processing outliers,which is superior to other methods.

where xiis the ith data in the dataset, x- is the average value of the dataset, and n is the amount of the dataset.

IF algorithm is used to identify and eliminate abnormal data to form a data set containing missing values, which are replaced by zero. The result is shown in Fig. 8. Fig. 8 shows the data distribution of four variables after eliminating outliers.The fan outlet oxygen concentration data contains 2 groups of intermittent data missing values and 5 groups of continuous data missing values.The fan outlet pressure data contains 5 groups of intermittent data missing values. The fan inlet flue gas temperature data contains 7 groups of intermittent missing data. The power wave scrubber inlet gas pressure data contains 7 groups of intermittent missing data. In addition, the data of fan outlet oxygen concentration and the fan outlet pressure have missing data at the same moment.

If only the data of a single time is missing, it is better not to directly use the information of the previous time, which will lose some data integrity. In order to save computing resources, the mean value compensation can be carried out according to the information before and after the time. And then the IF algorithm can be used to detect whether there are outliers.

Fig.10. Variables prediction error of IRF,SVM and RF:(a)fan outlet oxygen concentration;(b)fan outlet pressure;(c)fan inlet flue-gas temperature;(d)power wave scrubber inlet gas pressure.The prediction errors of compensation values of IRF are smaller than SVM and RF.The error of SVM and RF fluctuates greatly,and the fluctuation range is wide.

Table 4 Comparison of experimental results.

After eliminating abnormal data, IRF algorithm is used to compensate for the missing data. GA is used for joint optimization of RF parameters, including n_estimators, max_depth,min_samples_leaf and min_samples_split. The coefficient of determination(R2) is used as the return value of the objective function. In the IRF model, with the increase of the number of genetic algorithm iterations, R2of the optimal solution on the test set of the model can achieve the ideal result. Then we can find the optimal solution in the search space and get the optimal parameter combination. R2of the optimal parameter combination is shown in Table 3. As can be seen from Table 3, R2of parameter combination optimized by GA is greater than the default parameter of traditional random forest. The model accuracy of IRF is higher than RF.

In order to prove the effectiveness of the proposed algorithm and evaluate it objectively, we do data cleaning experiments. In the experiment, we compensate for the data of fan output oxygen concentration, fan outlet pressure, fan inlet flue gas temperature and power wave scrubber inlet flue gas pressure. The accuracy of data cleaning results is analyzed by root mean square error(RMSE),mean absolute error (MAE) and coefficient of determination (R2).The formulas are as follows.

where yiis real value,f(xi)is predictive value,and n is the number of samples. Root mean square error can reflect the accuracy of model prediction.

where yiis real value,f(xi)is predictive value,and n is the number of samples. Mean absolute error is often used to judge the error of regression model.

where yiis real value, f(xi) is predictive value, n is the number of samples,and y-is the mean of real value.The coefficient of determination reflects the proportion of variation in the dependent variable explained by the estimated regression equation. The closer R2is to 1, the higher the accuracy of the model is.

In order to prove the compensation ability of IRF for missing data, the experimental results are compared with SVM and RF respectively. Fig. 9 shows the prediction results of compensation values of IRF, SVM and RF. Fig. 10 shows the prediction error results of IRF, SVM and RF in the experiment. As is shown in Fig.9,in the prediction experiment of fan outlet oxygen concentration data, the prediction results of RF compensation value at the initial stage fit the trend of actual data well. However, the prediction results of RF fluctuate greatly in the later period, while the prediction results of IRF are better,and both are better than the fitting results of SVM. In the prediction experiments of fan outlet pressure, fan inlet flue gas temperature and power wave scrubber inlet flue gas pressure data,the fitting results of IRF are better than SVM and RF. Especially for the prediction of power wave scrubber inlet flue gas pressure data,the prediction result of IRF compensation value fits the trend of actual data quite well. As is seen from Fig. 10, the prediction errors of compensation values of IRF are all smaller than SVM and RF. The error of SVM and RF fluctuates greatly,and the fluctuation range is wide.As a result,IRF has better stability and accuracy for compensation for missing data.

Compared with RF and SVM, IRF has smaller error fluctuation and a better fitting trend. RMSE, MAE and R2of the three data cleaning algorithms are calculated respectively, as shown in Table 4. As is seen in Table 4, among the compensation performance evaluation indexes of the four variables data, RMSE and MAE of IRF are smaller. For example, in the data compensation experiment of fan outlet pressure, the RMSE of IRF is 46.6% lower than that of RF and 32.5% lower than that of SVM. The MAE of IRF is 33.8% lower than that of RF and 35.6% lower than that of SVM. The R2of IRF is 8.2% higher than that of RF and 8.3% higher than that of SVM. The average compensation effect is better and the prediction accuracy of the model is higher. For the evaluation index of R2,R2of IRF is larger than RF and SVM.This indicates that IRF model has higher accuracy and a better fitting trend.In conclusion,in the process of compensating for missing data in data cleaning,IRF’s cleaning effect is more accurate than RF and SVM.IRF has better data cleaning effect and higher accuracy.

Through all the above experiments, it can be found that after the data is processed by the IF model, the variance of the dataset decreases, indicating that the model can effectively identify and eliminate abnormal data. Then, the IRF model is used to compensate for the missing data. The IRF compensation prediction error fluctuates less. The IRF model is more accurate, and the fitting trend is better. The data can be effectively compensated.

5. Conclusions

To solve the problem of abnormal data in the process of acid production with flue gas,this paper proposes an improved random forest data cleaning method. Firstly, the IF algorithm is used to identify outlier data.Then the outliers are eliminated and the data set containing missing values is obtained.Finally,the IRF algorithm is used to compensate for the missing data set. The actual operation data are taken to verify the IRF algorithm, and the following conclusions are obtained.

(1) IF algorithm can eliminate outliers from the data of acid production with flue gas process.

(2) GA can find the optimal parameter combination of the RF model in search space.

(3) In the process of compensating for missing data sets,the IRF algorithm has a better cleaning effect than other algorithms,which proves the validity of the algorithm.

By eliminating and compensating the outliers in the data of acid production with flue gas,the data of normal operation process can be obtained. According to the reliable data set, a more accurate mathematical model can be established to control the inlet temperature of the converter stalely, and further improve the conversion rate of SO2to increase the yield of sulfuric acid. This will not only recycle more SO2, protecting the environment, but also improve economic benefits.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This study is supported by the National Natural Science Foundation of China (61873006) and Beijing Natural Science Foundation(4204087, 4212040).

Chinese Journal of Chemical Engineering2023年7期

Chinese Journal of Chemical Engineering的其它文章: Molybdenum tailored Co0/Co2+ active pairs on a perovskite-type oxide for direct ethanol synthesis from syngas; Remediation of oily soil using acidic sophorolipids micro-emulsion; Oxidative exfoliation of spent cathode carbon:A two-in-one strategy for its decontamination and high-valued application; The calculation and optimal allocation of transmission capacity in natural gas networks with MINLP models; Intensified shape selectivity and alkylation reaction for the two-step conversion of methanol aromatization to p-xylene; Minimax entropy-based co-training for fault diagnosis of blast furnace