Prediction of methane storage in covalent organic frameworks using big-data-mining approach

2022-01-06 01:42:30HuanZhangPeisongYangDuliYuKunfengWangQingyuanYang

Chinese Journal of Chemical Engineering 2021年11期

Huan Zhang, Peisong Yang, Duli Yu, Kunfeng Wang*, Qingyuan Yang,*

1 State Key Laboratory of Organic-Inorganic Composites, Beijing University of Chemical Technology, Beijing 100029, China

2 Beijing Advanced Innovation Center for Soft Matter Science and Engineering, Beijing University of Chemical Technology, Beijing 100029, China

3 College of Information Science and Technology, Beijing University of Chemical Technology, Beijing 100029, China

Keywords:Covalent organic framework Monte Carlo simulation Methane Machine learning Model

A B S T R A C T A combination of computational materials screening and machine learning (ML) technique is being adopted as a popular approach to study various materials toward application of interest. In this work,we began with high-throughput molecular simulations to calculate the methane storage (6.5 MPa) and deliverable (6.5-0.58 MPa) capacities of 404,460 covalent organic frameworks (COFs) at 298 K. Then,the full data sets with 23 features were randomly split into training and test sets in a ratio of 20:80,which were applied to evaluate the prediction abilities of several ML algorithms, including gradient boosting decision tree(GBDT),neural network(NN), support vector machine (SVM),random forest (RF)and decision tree(DT).The results indicate that the RF model has the highest prediction accuracy,which was further employed to reduce the dimension of features space and quantitatively analyze the relative importance of each feature value. The binary classification predictors built using the features with the highest influence weight can give a successful identification of top-performing candidates from the test set containing 323,168 COFs with an accuracy exceeding 96%.The deliverable capacities of the identified COFs were found to outperform those reported so far for various adsorbents.The findings may provide a useful guidance for the design and synthesis of new high-performance materials for methane storage application.

1. Introduction

As a relatively clean energy with abundant resource,natural gas has received much attention as a bridging fuel in the transition toward more sustainable energy economy. Driven by a rapid growth of energy demand in the transportation and industrial sectors,the U.S.Energy Information Administration(EIA)launched an annual outlook 2020 for natural gas production and consumption through the year 2050 [1]. For onboard applicability of natural gas, a long-standing challenge lies in its much lower volumetric energy density at standard temperature and pressure (STP:273.15 K, 1.0 atm) compared to gasoline [2,3]. Thus, natural gas with methane as the main component must be densified to attain commercially appealing driving mileage before refilling.Currently,compressed natural gas (CNG) is an approach adopted commonly but suffers from safety concerns,expensive infrastructure and high cost due to extreme operating conditions (～25 MPa at ambient temperature) [4,5], greatly attenuating the economic viability of this method for widespread application.

To promote the popularization of natural-gas-fueled vehicles,an attractive strategy is to use the system of adsorbed natural gas (ANG) that utilizes porous adsorbents to provide high-density natural gas at reduced storage pressure. In 2012, the U.S. Department of Energy(DOE) established an ambitious program outlining onboard storage target:practical adsorbents at 6.5 MPa and 298 K should at least achieve 263 cm3CH4(STP)·cm-3adsorbent and 0.5 g CH4·g-3adsorbent, where the former volumetric value is equivalent to the CNG uptake of an empty tank at 25 MPa [2],much higher than the old DOE target (180 cm3·cm-3at 3.5 MPa)[6]. Note that the volumetric storage capacity becomes 350 cm3-·cm-3if taking 25%packing loss into account,and the storage pressure is considered practically relevant due to easy realization by inexpensive two-stage compressors [7]. Subsequently, the Advanced Research Projects Agency-Energy (ARPA-E) of the U.S.DOE translated the above target into thedeliverable capacity(also referred asworking capacity) which is defined as the difference between the adsorption amounts of CH4at the storage (6.5 MPa)and depletion (0.58 MPa) pressures, where the latter is the minimum engine inlet pressure [2,5,8]. Generally speaking, volumetric performance holds greater role than gravimetric one for vehicular application,but high value of the latter is also an important metric to avoid massive fuel tanks. Extensive studies have been paid on traditional adsorbents such as carbon materials[9,10]and zeolites[11,12],but their performances are far below the DOE target.Additionally, increasing endeavors are being devoted to synthesizing porous metal-organic frameworks (MOFs) with better CH4storage properties, as evidenced from some excellent review papers [13-16]. Significant breakthroughs have been made by experimentalists since the year 2013 during which some MOFs have been scrutinized to exhibit incredibly high deliverable capacities,such as Cu-BTC (185 cm3·cm-3) [7], Co(bdp) (197 cm3·cm-3) [17], USTA-110a(190 cm3·cm-3) [18]; however, the relatively low porosities limit their gravimetric performances (0.150, 0.186 and 0.226 g·g-1,respectively). To the best of knowledge, the experimental MOFs reported so far with the highest gravimetric deliverable capacities are NU-1501-Al(0.347 g·g-1) [19] and Al-soc-MOF-1 (0.351 g·g-1)[20], while their volumetric properties are relatively low (138 and 167 cm3·cm-3respectively). Thus, a further important step is to rationally balance the volumetric and gravimetric CH4storage capacities in a single material.

Covalent organic frameworks (COFs) represent one of the most inspiring state-of-the-art nanoporous crystalline materials that have attracted great interest over recent decades [21-24]. Unlike MOFs, covalently-linked COF structures are entirely composed of light elements and generally much more stable [25], exhibiting a great prospect in a variety of application fields, including gas separation [26-29] and storage [30-33]. To date, COF structures reported experimentally are very limited (～520) compared to MOFs.Since the chemical diversity of monomers for COF synthesis is theoretically very rich, materials genomics-guided highthroughput screening can provide a strong complementarity to experimentation by rapidly gauging the performance of existing[34-37] and/or hypothetical [8,38,39] materials. In recent years,great progress has been made on computational discovery of novel high-performance MOFs with a conjunction of machining learning(ML) techniques driven by big-data analysis [40-45], while the related studies on COFs are scarcely reported in literature [46].

Motivated by the aforementioned facts, a combination of molecular simulation and ML methods was used to screen the performance of over 472,500 COFs for CH4storage at ambient temperature. Four ML algorithms including gradient boosting decision tree (GBDT), neural network (NN), support vector machine(SVM), random forest (RF) and decision tree (DT) were evaluated in this work to compare how different models predict the data set.Then RF algorithm was used to extract the key descriptors that dominate COF storage properties, from which DT model was constructed to develop binary classification predictors for the COF properties required for achieving high CH4storage capacity under different conditions.The DT model successfully differentiated lowperformance region from high-performance region in the COFs search space.Finally,10 COFs were further identified with the best methane storage performance.

2. Computational Methodologies

2.1. Database of COF structures

The COF structures examined in current study were taken from two sources.One is from the CoRE-COF database that covers almost all the COF materials synthesized so far.In 2017,we published the first version of this database with 187 structures [47]. By persistently trailing related literature,here we considered a significantly updated version(called CoRE-COF 2020) that contains 517 experimental structures, which are publicly available at https://mofsgenomics.github.io/CoRE-COF-Database.The other source was taken from a huge database built in our previous study[48],which consists of 471,990 structures computationally assembled using a materials genomics-based method called quasi-reactive assembly algorithms(QReaxAA), on the basis of 130 genetic structural units(GSU)that are mainly partitioned from existing COF structures.The structural features of all the examined COFs are characterized using the Zeo++ package [49]. For accessible surface area and void fraction, the spherical radius of probe molecules was set to 0.184 nm (the kinetic radius of N2) and 0.00 nm respectively. The structures with pore limiting diameter (PLD) smaller than the kinetic diameter of CH4molecule (0.38 nm) and with zero nitrogen-sized accessible surface area were excluded, leading to a total of 404,460 COF structures (denoted as hCOFs) in the refined database.

2.2. Molecular simulations

Grand canonical Monte Carlo (GCMC) simulations were performed to assess CH4storage performance of the COFs at 298 K and two pressures (6.5 and 0.58 MPa), using our in-house code HT-CADSS. A spherical Lennard-Jones (LJ) model was used to represent CH4molecule with potential parameters taken from the TraPPE force field [50]. The LJ parameters for the COF atoms were taken from DREIDING force field [51], as shown in Table S1 (Supplementary material). All the LJ cross interaction parameters were determined by the Lorentz-Berthelot combination rules. Periodic boundary conditions were considered in all three dimensions,and a cutoff distance of 1.40 nm was applied to calculate the LJ interactions. Each GCMC simulation consists of 5 × 106steps for system equilibration and property sampling, respectively. The zero-coverage heat of adsorption and Henry coefficients of CH4in each COF were calculated using the revised Widom’s test particle method[52].The validations of the adopted force fields are shown in Fig. S1. Three widely-used evaluation metrics were applied to assess materials performance: the total storage capacity (Nads) at 6.5 MPa, the deliverable capacity (ΔN) between 6.5 and 0.58 MPa, and the percent regenerability (R), as defined by:

2.3. Machine learning algorithms and features

In order to analyze the CH4storage performance of COFs from ML aspect, five ML algorithms including GBDT, NN, SVM, RF and DT were considered in this work. NN is an operational model that emulates information processing in the biological nervous system,in which the network structure consists of an input layer, one or more hidden layers and an output layer[53].SVM is a kind of data dichotomy algorithm based on supervised learning to find a hyperplane with the largest category spacing in feature space[54].DT is a tree-structured classification model in which each internal node denotes a feature judgment,each branch represents the output of a judgment result, and each leaf node represents a category [55].GBDT and RF are the extensions of DT algorithms by improving the generalization ability through the voting mechanism of multiple decision trees.The difference is that the trees making up RF can be classification trees or regression trees which are generated in parallel, and majority voting is adopted for the output results[56]; GBDT is only composed of regression trees which can only be generated sequentially, and all results are accumulated or weighted accumulated as the final result [57]. We have also tried the ML algorithm like Gaussian process regression. By selecting 8000 pieces of data for testing,it was found that the modeling process of this algorithm took too long time and result in unsatisfactory prediction results. Generally speaking, the Gaussian algorithm is only suitable for the regression problems of small samples and loses the validity in high-dimensional space of features. Since the amount of data and dimensions used in current work are very large, this regression algorithm was not considered in our analysis.

In order to make the machine learning model more accurate,we used 23 various features covering structural variables, chemical variables and specific element contents of the COFs(Table S2,Supplementary material).The structural variables include free volume,density,volumetric surface area,void fraction and so on.The zerocoverage heat of adsorption and Henry coefficient were introduced as chemical variables. In addition, a text-mining algorithm was used to capture the type and number of atoms in each COF structure, from which the atom number density was taken as a kind of variable,which was calculated the ratio of the number of atoms for specific element to total number of atoms in each structure.

In current study,the data set was randomly split into the training set and test set in a ratio of 20:80. We adopted 10-fold crossvalidation to improve the overall performance of models [58].The entire training set was randomly divided into 10 equal parts,where 9 parts were for training the algorithms while the remaining one for model validations. The whole procedure was repeated 10 times to choose the optimal model. The accuracy of model fitting can be described byR2value and the root-mean-square error(RMSE).RMSE represents the root-mean-square error that is calculated as the average deviation of the differences between the predicted values and the true values.This metric is commonly used to statistically evaluate the quality of prediction model regressed from machine-learning technique.However,when the dimensions are different, using RMSE alone is difficult to measure the performance of regression models, which requires the use ofR2.R2denotes the coefficient of determination that compares the error of model fitting with the mean of original output data,whose value ranging from 0 to 1. WhenR2= 1, it means that the predicted values and true values in the sample are exactly the same,indicating a perfect fit by the model. WhenR2tends to 0, it indicates that predicted value of the model is not as good as using the mean directly.Therefore, a combination ofR2and RMSE can effectively evaluate the accuracy of the model. The expressions of the two quantities are given by:

whereNrepresents the number of COFs,Yirepresents simulated CH4storage value, andUi,U-represent predicted CH4storage value and average CH4storage value, respectively.

In the classification part,we used the receiver operating characteristic(ROC)[59]curve to evaluate the prediction performance of DT model. ROC curve is drawn on the vertical axis of true positive rate (TPR) and horizontal axis of false positive rate (FPR), and the evaluation indexes such as accuracy, precision, recall score and F1-score are obtained. The precision (P) shows the proportion of true positives predicted correctly to the total number of positive predictions, while the recall (R) represents the proportion of true positives predicted correctly to the total number of true positives,which is the same as TPR.

where TP is the number of true positive samples, FP is the number of false positive samples,TN is the number of true negative samples,and FN is the number of false negative samples.

3. Results and Discussion

3.1. Comparison of ML models

As the first step in the analysis process, we compared the prediction performance of different ML algorithms to choose the best model. We used 23 features as input parameters to evaluate the four algorithms: GBDT, NN, SVM and RF. Here, the DT algorithm was not considered is due to the fact that the GDBT and RF algorithms are both the upgraded versions of the DT one.The DT algorithm only uses one decision tree while multiple trees adopted in the GBDT and RF ones,resulting in the latter two algorithms having stronger fitting and generalization capability. Fig. 1 shows a comparison of the results predicted from the four ML models and the GCMC-simulated values on the test set carried out at 0.58 MPa,6.5 MPa for storage capacity, and 6.5-0.58 MPa for deliverable capacity. Note that the closer the predicted data to the diagonal,the higher the predictive ability of the model. Among all the prediction algorithms evaluated, GBDT and RF consistently show the best results. At 0.58 MPa,R2can reach the optimal value of 0.99,RMSEcan achieve 1.380 and 1.243 respectively.This indicates that the two models can accurately predict the absolute volumetric storage capacity of CH4in the test set, which highlights the strong generalization ability of them. However, there is a large deviation between the predicted value and actual value using the NN and SVM models.SVM has the worst predictive ability with an accuracy of only 0.442 at 0.58 MPa and NN can only achieve 0.762. These observations can be explained as follows. Both RF and GBDT use an integration of classifiers based on decision trees,which are further coupled with random attribute selection and dataset selection.However, to reduce the model complexity, the NN used in this work is a shallow network model, which is suitable for dealing with more complex problems like image and language recognition.For SVM,it can give better results when the amount of data is small but not applicable for a large amount of data. These differences between the algorithms result in higher accuracy of the RF and GBDT than NN and SVM.

Fig.1. Comparison of the prediction results from four ML models with GCMC-simulated volumetric CH4 storage capacities of the COFs using 23 features as input for the test set at 298 K.(a)GBDT;(b)NN;(c)SVM;(d)RF.The left column is for 0.58 MPa,middle column is for 6.5 MPa,and right column is for deliverable capacity.The red diagonal lines indicate a complete coincidence of ML predictions and GCMC simulations.

It is worth noting that the predictive performance of the ML model at low pressure (0.58 MPa) is less prominent than that at high pressure (6.5 MPa). It can be inferred that the relationship between the features and storage capacity is more significant under high pressure.The increase of pressure enhances the prediction ability of models and all four models produce betterR2value and RMSE, with the error relatively decreasing. In particular, the SVM model achieves the best enhancement in prediction ability withR2varying from 0.442 to 0.664;meanwhile,GBDT and RF produce the best prediction ability, and theR2values at 6.5 MPa achieve more than 0.98. Thus, compared with other algorithms,GBDT and RF have good adaptability to the data sets. The evaluation of the predictive performance of volumetric CH4storage capacity by the four ML models are summarized in Table 1.

Table 1 Evaluation of the predictive performance of four different models for volumetric CH4 storage capacity at 298 K

3.2. The importance of features by the RF model

As can be seen from the above results,both GBDT and RF models have strong learning ability and good generalization ability.RF and GBDT can not only be used as a prediction model but also evaluate the features’ importance in predicting CH4storage capacity,which implies the relative importance of each feature during ML training. However, RF is not sensitive to outliers, while GBDT is quite sensitive to them. Through the comparison ofR2, we found that RF has higher accuracy and certain anti-noise ability, so we chose the RF model for subsequent analysis. To establish a better RF model without multicollinearity, we deleted five features(PLD, GCD,Dc, Vol,Vc) according to the strong correlation analysis between different features. In this work, a strong correlation is considered to exist if the correlation coefficient is greater than 0.9,finally obtaining 18 features(see Table S3 and the details provided in the Supplementary material).

Among the features considered in this work,andKHwere classified as chemical variables, while the rest were considered as structural variables. Then, the two different classes of variables were used to evaluate the accuracy of RF model in two cases(structural only variables, both structural and chemical variables) for predicting the volumetric CH4storage capacities of the COFs under different pressures. In order to examine the role of atom number density in RF model prediction, the variables were also combined with the above two types of variables for analysis(the third case),as shown in Table 2.Obviously,combining structural and chemical variables increases the predictive power of the RF model compared to the only consideration of structural variables, especially at 0.58 MPa, with theR2increases from 0.938 to 0.975, indicating the importance of chemical variables under low pressure. At 6.5 MPa,R2can reach 0.963 with structural variables after adding chemical variables, which only increases by 0.005, indicating that high-pressure storage capacity is mainly determined by the overall distribution space within the material framework, while adsorption-affinity relevant variables have little influence on model prediction results.The results obtained at deliverable capacity are similar to those obtained at 6.5 MPa. It also can be found that after adding atom number density,R2increases for all the three examined adsorption properties, but the effect is not very significant.In the study of Fanourgakiset al.[40],it was found that the ML methods using a combination of atom types and structural features as descriptors instead of building blocks as descriptors could provide an accurate prediction (withR2reaching 0.96). The results are almost the same as ours and the subtle differences observed should be attributed to the two different COF databases used in our work and the different descriptors.

Table 2 Predictive performance of the RF model for volumetric CH4 storage capacities of COFs using only structural variables(I),both structural variables and chemical variables(II),and the former two types of variables plus atom number densities (III) at 298 K

To further study how the features affect the CH4storage capacity, we trained RF models optimally using 18 features and ranked the importance of independent features. The principle of calculating the importance of features by RF algorithm is to add random noise to each feature step by step,from which the degree of influence on the model accuracy is examined.If the data accuracy drops significantly after adding noise,it means that current feature has a great influence on the prediction results of samples and thus is of high importance; otherwise, it indicates a low importance of the feature. Fig. 2 shows the results of ranking the importance of features under the conditions of 0.58 MPa, 6.5 MPa and the deliverable capacity between them. As shown in Fig. 2a,Q0staccounts for the largest contribution at 0.58 MPa(about 18.8%).This is because only a small amount of CH4can be adsorbed in COFs at such low pressure, not capable of occupying the whole void space of the materials. Under such conditions, the adsorption is mainly drived by the van der Waals forces between gas molecules and the skeleton atoms,leading to weaker correlations between structural variables and storage capacity.Thus,the predictive ability of the model can be improved by incorporating the action ofQ0st. The feature ranked with the second importance is VSA (accounting for 14%).LCD,VfreeandKHalso have remarkable influence on the predictive performance of the RF model. Note that the feature importance ofKHis smaller than that ofQ0st, although a strong correlation exists between them. The two features are both the indicator of the adsorption affinity of an adsorbent towards specific gas molecules in the low-pressure range, and thus should play a very important role in determining the prediction accuracy of ML model. In the modeling process using RF algorithm, each threshold of each feature needs to be exhaustively listed,so as to find the best segmentation feature and the optimal segmentation point to minimize the square error of model prediction. If a feature is very important to the performance of the model and there are other features closely related to it,the possibility of selecting this feature will be reduced when the decision tree searches for the best segmentation feature,resulting in the importance of this feature being diluted.During RF modeling, due to the randomness of feature selection when the decision tree is split, the probability ofQ0stbeing selected as the best segmentation feature is higher thanKH.Therefore,the feature importance ofQ0stfor model will be higher while that ofKHis weakened.At the same time,since VSA is also an important factor affecting the storage capacity under low pressure, it will cause the feature importance ofKHto be less evident than that of VSA.

For the storage capacity at 6.5 MPa, Fig. 2b shows that the relative importance ranking of features is VSA ＞Si% ＞LCD ＞＞φ.The largest contribution of VSA (accounting for ～33%) indicates that it plays a major role in the prediction of volumetric methane storage capacity, which is generally similar to the conclusions obtained in literature [60]. This is due to one of the main driving forces for adsorption is the pressure that pushes gas molecules into the channel of adsorbents.With the increase of pressure,more CH4molecules are adsorbed in COFs, and the preferential adsorption sites have already been occupied so that the filling effect becomes the main influencing factor for adsorption. From the viewpoint of material science, larger LCD is usually accompanied with larger VSA.Thus,to accommodate more guest molecules at high pressure,COFs need to have larger pore size and higher specific surface area.

Fig.2. The relative importance of the features obtained from the optimally trained RF models for CH4.(a)storage capacity at 0.58 MPa,(b)storage capacity at 6.5 MPa,and(c)6.5-0.58 MPa deliverable capacity.

For the case of 6.5-0.58 MPa deliverable capacity (Fig. 2c), the relative feature importance of VSA still remains the highest,accounting for 36%. The trend in the importance proportion of other features is basically similar to that observed at 6.5 MPa,indicating that the effects of these parameters on the storage capacity at high pressure and the deliverable capacity are essentially the same. It is worth noting that the number density of silicon atom is ranked as the second important feature from the optimally trained RF model. In the entire database, the COFs containing Si element generally exhibit good performance, where more than 81% of them have a deliverable capacity greater than 120 cm3·cm-3,indicating that silicon element has remarkable effect on discriminating the adsorption performance of COFs. This means that selecting silicon concentration as the split feature in the RF modeling process will make the model regression move in the direction of reducing the prediction error.Consequently,this variable is beneficial to improve the prediction accuracy of the RF model. However, no matter high-performance or low-performance materials,the proportions of other elements like C, H and O are relatively high,so there is no way to distinguish the adsorption performance of the materials only by the element proportion of C, H, O. Therefore,their contribution values are very low and thus can be almost ignored.

3.3. Decision tree classifiers for the CH4 storage capacities of COFs

Since the RF and GBDT algorithms are composed of multiple decision trees, they cannot be used to visually display the classification path of a decision tree separately, and thus are not conducive to classification analysis. In contrast, Using DT classification model can effectively achieve this purpose and provide easy-to-follow rules in COF design to achieve desired attribute goals. We used the results obtained from RF to extract effective rules for DT model construction and training. When the experimental or theoretical verification cost of the optimal candidate is not high, a lower cutoff value can be selected to identify almost all potential high-performance candidates.Before building a classification model, it needs to choose an appropriate threshold to avoid the problem of imbalanced data categories. In this work,we selected the threshold based on the top 20% of the highperformance materials in the database [61]. The thresholds of the storage capacity at 6.5 MPa and deliverable capacity are 165 cm3-·cm-3and 139 cm3·cm-3,respectively.The instances in the dataset are labeled as 1 or 0 according to their storage capacities greater or less than the thresholds.In order to provide better information,we showed the nodes of the first three layers of DT model, and the division paths are shown in Fig.3.We have marked two paths with different colors, green and red, representing the selection criteria for low-performance materials and high-performance materials,respectively. At 6.5 MPa, the red path can get the following law:when VSA ≥1515.58 m2·cm-3and φ ≤0.876,COFs with high performance can be obtained.According to this rule,we can filter out 12,971 COFs from training set, of which 12,798 samples are real high-performance materials, with recall of 98.67%. According to the green path,we can get the low-performance material screening rule:when VSA ≤1291.78 m2·cm-3and≤25.228 kJ·mol-1,it is classified as low-performance material. According to this rule, we screened out 62,411 COFs, of which 62,095 samples are real lowperformance materials,with recall of 99.49%.This shows the effectiveness of DT model. For deliverable capacity, similar screening rules were also formulated. When VSA ≤1206.32 m2·cm-3, COFs have lower deliverable capacity; when VSA ≥1327.62 m2·cm-3,0.707 ≤φ ≤0.885, COFs have higher deliverable capacity. The specific predictions of DT model on the training set and test set are shown in Table 3. As can be seen from the table, according to the results obtained by DT model, the accuracy in training set can reach 99.35%and 98.90%,respectively,and the prediction ability for low-performance materials is as high as 99%. The results demonstrate that this model can effectively distinguish highperformance materials from low-performance materials and filter low-performance COFs.

Fig.3. Visualization of the path of DT algorithm based on basic features,where the results of the first three layers of the model are shown.(a)6.5 MPa for storage capacity;(b)6.5-0.58 MPa for deliverable capacity.

Table 3 Predictive performance of the decision-tree model on the training and test datasets

In order to further analyze the generalization performance of DT model and prediction results on test set, we applied ROC curve to evaluate the performance of classification model, which gives the relationship between TPR and FPR. The greater the deviation between the upper left corner of curve and the random guess baseline (black dotted line), the higher the DT model prediction accuracy. Fig. 4a shows that the ROC area is 0.99 at 6.5 MPa, where the recall score is 96.99%, the precision is 91.40% and theF1-score is 94.11%.There are a total of 323,168 COFs in test set,among which 259,048 instances are classified as low-performance and real low-performance instances are 253,194. 64,120 materials are classified as high-performance, among which 62,189 instances are truly high-performance materials. For deliverable capacity,the ROC area is 0.99 and theF1-score is 95.23%. In this case,252,763 and 63,992 instances are respectively classified as real low-performance and real high-performance materials, leading to a total accuracy rate of 98.02%. The results show a good performance of the DT model and prove that COFs based on methane adsorption performance can be screened and classified.

3.4. Sorbent performance ranking

To the best of our knowledge, the current best-performing COF for methane storage reported in literature is COF-102 [62], which exhibits a volumetric storage capacity of 194 cm3·cm-3at 6.5 MPa [30]. Therefore, in the following discussion, we focused on the COFs with deliverable capacity greater than 190 cm3·cm-3,as these materials are likely to compete with current well-known materials.In this work,we emphasized the materials with regenerability (R) ＞85% and ranked them in terms of volumetric deliverable capacity. It was found that the GCMC-simulated volumetric storage capacities of COFs at 6.5 MPa are ranged from 20.43 to 272.41 cm3·cm-3, along with the deliverable capacities distribution in the range of 7.50-227.81 cm3·cm-3.Based on the selected cutoff values(Nads,6.5MPa=220 cm3·cm-3and ΔN=190 cm3·cm-3)as threshold,the COFs classified as‘‘efficient”or‘‘inefficient”materials are dependent on whether their CH4storage properties are above or below the threshold. In previous study, Rocío Mercadoet al.[39] found that the predicted maximum deliverable capacity of methane is 216 cm3·cm-3at 6.5 MPa in the COFs database built by them and 300 COFs have a deliverable capacity of higher than 190 cm3·cm-3among which 10% structures have production capacity of more than 200 cm3·cm-3. As shown in Fig. 5a, there are a certain amount of COFs reaching this threshold. Among a total of 651 COFs, there are 73 materials above the red line (Rgreater than 85%).The 10 COFs shown in Fig.5b represent the most promising COFs with both highR(＞85%) and high deliverable capacity, all of which have 3D structures. These COFs are named as COF-A to COF-J and their adsorption properties are given in Table 4.COF-A is the material with the highest deliverable capacity(227.8 cm3·cm-3), superior to the best CH4storage material reported so far. The microscopic adsorption behaviors of CH4in COF-A are shown in Fig. S2. COF-G has the highestRvalue(88.5%), along with deliverable capacity of 204.9 cm3·cm-3.

Fig. 4. Receiver Operating Characteristic (ROC) curve of DT model in the test set for CH4. (a) storage capacity at 6.5 MPa and (b) 6.5-0.58 MPa deliverable capacity.

Fig.5. (a)Regenerability versus deliverable capacity for CH4 adsorption in the COFs.The red dashed line is used to identify high-performance COFs with regenerability(R)no less than R = 85%; (b) Regenerability versus deliverable capacity for the COFs with the R ＞85% and deliverable capacity ＞190 cm3·cm-3, where COF-A to COF-J are the identified top 10 materials.

Fig. 6a shows the relationship between the 6.5-0.58 MPa volumetric deliverable capacity and the total storage capacity at 6.5 MPa simulated for the entire COF database.In addition,we also compared the storage performance of the identified top 10 COFs with those of some existing porous materials, as shown in Fig. 7.Obviously, the materials with high total storage capacity do not necessarily have high deliverable capacity. Among the selected materials, while MOFs have better performance in which Cu-BTC and MAF-38[63]can achieve DOE’s CH4storage target,the storage capacities of porous carbon materials are poor at 6.5 MPa(Fig.7a).The details of structural parameters and storage properties of these materials are listed in Table S4. Compared with the experimental data of existing COFs, the COFs identified from computational screening show higher deliverable capacity and a small part of COFs can reach more than 200 cm3·cm-3, indicating their great potential for methane storage application. Generally, an ideal material should have a good balance between high gravimetric and volumetric storage capacities for onboard usage. From this viewpoint,a further important step is to rationally balance the volumetric and gravimetric CH4storage capacities in a single material.The screening results shown in Fig. 6b indicate that there is a trade-off effect between volumetric deliverable capacities and gravimetric deliverable capacities. We compared the gravimetric storage capacities of the existing materials and the screened COFs,as shown in Fig.7b.As far as we know,few of the existing materials can meet the new gravimetric target of 0.50 g·g-1, most of which are far less than 0.3 g·g-1. Due to the low density and high accessible surface area, COFs show an advantage in gravimetric storage capacity. Among the screened COFs, there are several materials that almost can meet the new DOE volumetric target, all of which also have high volumetric methane storage capacity. COF-D is the most promising material because of its high gravimetric storage capacity of 0.5 g·g-1at 298 K and 6.5 MPa. These observations suggest that the identified top-performing materials exhibit the best balance between gravimetric and volumetric storage capacities compared to other nanoporous materials.

Fig. 7. Comparison of CH4 storage properties of the best materials identified in this work with those high-performance materials reported in literatures. (a) Volumetric deliverable capacity versus total storage capacity;(b) Gravimetric deliverable capacity versus total storage capacity.Conditions:6.5 MPa for total storage capacity and 6.5-0.58 MPa for deliverable capacity.

Table 4 Properties of the top 10 materials identified with the best CH4 adsorption performance

Fig.6. High-throughput screening of the entire COF database for methane storage at 298 K.(a)Volumetric deliverable capacity versus total storage capacity;(b)Volumetric versus gravimetric deliverable capacities. hCOFs and CoRE-COFs represent the generated and experimental structures, respectively.

4. Conclusions

We used molecular simulation and several ML techniques to perform high-throughput screening of the methane storage performance of COFs in a large database. The comparison of four ML models with 23 features as input showed that RF algorithm has the best prediction ability for whichR2can reach 0.993 and 0.992 respectively at 0.58 and 6.5 MPa. By using the RF algorithm to reduce the dimensionality of features, the quantitative analysis of relative importance showed thatis the main factor affecting the adsorption capacity at 0.58 MPa. At high pressure, the main driving force of adsorption is pressure and VSA accounts for the largest proportion of the model.A simple binary DT model showed that an optimal combination of,φ and VSA can be used as a rule of thumb for methane storage under high pressure. Moreover, DT has successfully yielded a classifier with an accuracy of over 96%.The application of this classifier for material prescreening will greatly reduce the computational cost,making it feasible to search promising candidates in unwieldy space. Through screening, 10 COFs were identified with the best storage performance, among which the best one can achieve the current maximum CH4deliverable capacity of 227 cm3·cm-3. Furthermore, the results also showed that combining ML algorithms with GCMC simulation,structural features and zero-coverage heat of adsorption can be taken into account in material analysis to maintain high computational speed while greatly improving prediction accuracy. The results obtained in this work may be conducive to the screening of COFs, paving the way for faster and more reliable predictions.

Acknowledgements

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This work was financially supported by the National Natural Science Foundation of China (22078004), the Fundamental Research Funds for the Central Universities (buctrc201727) and the Big Science Project from BUCT (XK180301).

Supplementary Material

Supplementary data to this article can be found online at https://doi.org/10.1016/j.cjche.2021.03.002.

Chinese Journal of Chemical Engineering2021年11期

Chinese Journal of Chemical Engineering的其它文章: A panoramic view of Li7P3S11 solid electrolytes synthesis, structural aspects and practical challenges for all-solid-state lithium batteries; Facile synthesis of spinel LiNi0.5Mn1.5O4 as 5.0 V-class high-voltage cathode materials for Li-ion batteries; Functional graphene oxide nanosheets modified with cyclodextrins for removal of Bisphenol A from water; Thermodynamics and kinetics insights into naphthalene hydrogenation over a Ni-Mo catalyst; Hydrodynamics and mass transfer performance analysis of flow-guided trapezoid spray packing tray; Measurement methods of particle size distribution in emulsion polymerization