Machine learning for molecular thermodynamics

2021-05-18 11:06:50JiaqiDingNanXuManhTienNguyenQiQiaoYaoShiYiHeQingShao

Chinese Journal of Chemical Engineering 2021年3期

Jiaqi Ding,Nan Xu,Manh Tien Nguyen,Qi Qiao,Yao Shi,3,Yi He,4,*,Qing Shao,*

1 College of Chemical and Biological Engineering,Zhejiang University,Hangzhou 310027,China

2 Chemical and Materials Engineering Department,University of Kentucky,Lexington,KY 40506,USA

3 Key Laboratory of Biomass Chemical Engineering of Ministry of Education,Zhejiang University,Hangzhou 310027,China

4 Department of Chemical Engineering,University of Washington,Seattle,WA 98195,USA

Keywords:Machine learning Thermodynamic properties Molecular engineering Molecular simulation Force field

ABSTRACT Thermodynamic properties of complex systems play an essential role in developing chemical engineering processes.It remains a challenge to predict the thermodynamic properties of complex systems in a wide range and describe the behavior of ions and molecules in complex systems.Machine learning emerges as a powerful tool to resolve this issue because it can describe complex relationships beyond the capacity of traditional mathematical functions.This minireview will summarize some fundamental concepts of machine learning methods and their applications in three aspects of the molecular thermodynamics using several examples.The first aspect is to apply machine learning methods to predict the thermodynamic properties of a broad spectrum of systems based on known data.The second aspect is to integer machine learning and molecular simulations to accelerate the discovery of materials.The third aspect is to develop machine learning force field that can eliminate the barrier between quantum mechanics and all-atom molecular dynamics simulations.The applications in these three aspects illustrate the potential of machine learning in molecular thermodynamics of chemical engineering.We will also discuss the perspective of the broad applications of machine learning in chemical engineering.

1.Introduction

Machine learning has emerged as a versatile tool to predict and investigate the molecular thermodynamics of complex systems.One task for molecular thermodynamics is to predict the thermodynamic properties of complex systems at high accuracy.The traditional theoretical or semi-theoretical models have been successful in predicting the thermodynamic properties of various systems[1–11].However,it remains a challenge to develop robust models that can predict the thermodynamic properties of complex systems at the molecular level using the traditional theoretical framework.Machine learning paves an avenue to develop such models.Machine learning refers to computer algorithms that can learn from experience [12,13].The machine learning methods are well known for their extraordinary performance in self-driving vehicles and chess games against human beings [14–16].Due to the ability of machine learning models to describe the complicated relationship between variables,we have witnessed growing applications of these models in the material and chemical engineering fields,ranging from drug design [17–22],structure exploration[23–27]to molecular thermodynamics prediction [28–34].

Machine learning can help the research of molecular thermodynamics from at least three aspects(Fig.1).First,it can help develop robust thermodynamic models for complex systems.Obtaining accurate thermodynamic properties of a system is essential for developing related processes.Indeed,numerous theoretical or semi-theoretical models have been developed for calculating the thermodynamic properties of various systems.These models usually consist of mathematical functions using hypothesis-driven strategies and the function parameters are determined from theory or experiments [8,10,35].Many of the theoretical-based methods are very successful in predicting thermodynamic properties such as heat capacity or activity.However,developing these methods mostly rely on the intuition of the researchers or a ‘‘trial-anderror”procedure.The machine learning methods provide a new path for predicting the thermodynamic properties of complex systems.Many machine learning models such as the neural network and decision tree can quantify the relationship between the targeted thermodynamic properties of the system with selected molecular descriptors [36–38].Massive experimental and simulation databases provide the opportunity to develop such machine learning models [39–42].

Fig.1.Three contributions of machine learning to molecular thermodynamics.(1)predicting thermodynamic properties based on given molecular descriptors;(2)designing materials by integrating with large-scale molecular simulations;and (3)developing many-body force fields that enable the simulations of complex systems.

Second,machine learning can predict the behavior of complex systems by combining with classical molecular simulations.Molecular simulations have shown their essential role in revealing mechanisms governing thermodynamics of complex systems.The simulations produce trajectories of coordinates and velocities of atoms or particles that can be used to analyze their thermodynamic and kinetic properties [43–45].This analysis is usually conducted using a ‘‘hypothesis-driven”strategy.Machine learning methods could help harvest more knowledge from the trajectories using the ‘‘data-driven”strategy.High-throughput molecular simulations contribute to this ‘‘data-driven”strategy by producing a large amount of data [13,46–50].These data far exceed the scale of classical experimental methods,while the scale and diversity of databases affect the performance of the machine learning model.

Third,machine learning can provide ‘‘many-body”force fields for molecular simulations,enabling a fast and accurate description of the systems [51–55].Empirical force fields like OPLS-AA,CHARMM,and Amber often use pre-set functions with fitted parameters to describe pairwise interactions of a multi-atoms system [56–59].Classical MD simulations with empirical force fields are computational-friendly but lacking the ability to describe complex interaction involving bond breaking and formation.On the contrary,quantum mechanical simulations can describe complicating systems but the computational cost is quite intensive.The machine learning force fields emerge recently to achieve the balance between computational cost and accuracy.Their unique feature originates from the ability of the machine learning models to describe complex interactions of systems.The simulations using the machine learning force fields can achieve the accuracy of quantum mechanical calculations and keep the efficiency of classical MD.

This minireview will introduce several examples in the applications of machine learning in the three aspects mentioned above.The goal of this minireview is to inspire the application of machine learning in molecular thermodynamics and introduce necessary knowledge of machine learning.Many comprehensive reviews have summarized the applications of machine learning methods in the development of materials and processes [60–65].The rest of this article is organized as follows.Section 2 will introduce some basic concepts of machine learning.Section 3 will illustrate the applications of machine learning in predicting complex relationship based on experimental databases.Section 4 will discuss the applications of machine learning based on high-throughput molecular simulations using the examples of peptide design and prediction of thermodynamic properties of alkanes.Section 5 will discuss the applications of machine learning in describing complex potential that enables the high speed and accurate simulations.Finally,we will present a perspective of machine learning in molecular thermodynamics.

2.Data,Descriptor,and Algorithm

As stated above,machine learning refers to the methods that can predict the output of a system based on selected features (descriptors) using the algorithms trained by a set of databases.The development of a machine learning model includes four critical steps:(a) creating a database containing suitable samples,(b)selecting the proper descriptors and algorithm,(c) training the algorithm based on the database and descriptors,and (d) utilizing the trained model to predict the system output based on the selected descriptors[66,67].Fig.2 shows the schematic of a typical process of developing a machine learning model.The success of a machine learning model depends on three ingredients:database,descriptors,and algorithms [66,68,69].The research on the three ingredients is growing rapidly and many comprehensive reviews have covered the relevant knowledge of them[62,66,67].This section will only introduce them briefly.

2.1.Database

A database refers to the collection of experimental and computational results that could be used to train and validate the machine learning model [66].Quite a few thermodynamic databases have been published [70,71].Indeed,these databases are playing an important role in the development of machine learning models.However,they cannot satisfy the growing demands due to the rapid progress in machine learning.Researchers usually need to develop their own databases for specific purposes.Several issues are critical for developing a database:the quality and scope of the data [72].The quality the data determines how well the machine learning model may perform[17,72].Data cleaning is often needed to prepare qualified data set by removing incorrect,irrelevant,duplicated data or modifying unformatted data [66,73].The scope and diversity of the data also determine the performance of the machine learning model [17,72,74].

Fig.2.A general process of developing a machine learning model.

2.2.Descriptor

Descriptors refer to the features that can be used to characterize the system [67].They are important attributes of the data and selected as inputs for the machine learning model.A traditional molecular descriptor can describe the system qualitatively or quantitatively.For instance,a qualitative descriptor for a molecule can be if it is hydrophobic (1:yes,and 0:no).A quantitative descriptor can be its molecular weight.Another widely used descriptor is group contribution descriptor,which represent the frequency of each distinct fragment[75–78].Thousands of descriptors have been developed up to now.A large number of descriptors emphasizes the importance of selecting proper ones.Generally,the chosen descriptors should be relevant to the target output,and do not contain highly correlate ones(e.g.hydrophilicity and solubility in water) [67,79,80].Several methods have been developed to select descriptors in terms of eliminating the redundant ones and retaining the most relevant features[81–85].Some novel methods to construct descriptors were inspired by the field of natural language processing (NLP) [86–89].One example is Mol2Vec [90],which considers substructures as‘‘words”and compounds as‘‘sentences”(see example in Section 3.2).These methods learn the feature vectors of structures from classical molecular representations such as simplified molecular input line entry system(SMILES)and molecular fingerprints [91,92].However,the descriptors used to describe bulk systems such as crystalline structures and amorphous systems are quite different with molecular descriptors.Several techniques have been developed to describe local environment of each atom in the system including smooth overlap of atomic positions (SOAP) [93–95]and atom-centered symmetry functions(ACSF)[96–98](see example in Section 5).SOAP uses orthonormal functions to obtain the local expansion of gaussian smeared atomic density,while ACSF determines the local structure by using a series of two-and three-body symmetry functions representing the environment near a specific atom.

2.3.Algorithm

Algorithms refer to mathematical models that can be used to predict a variable of a system based on other variables [99].Algorithms connect the descriptors and outputs either qualitatively or quantitatively.Such a connection helps understand the systems and predict new data [99].Selecting an appropriate algorithm is another key to building a successful machine learning model.The algorithm is probably one of the most vivid areas in machine learning [100].Many algorithms and their derivatives have been proposed recently [101].These algorithms can be grouped into two main categories:supervised learning and unsupervised learning[17].Supervised learning refers to learning with labeled data(known input and output),and training a model that can predict values of future inputs.For applications in molecular thermodynamics,these output values can be either the category of a unknow compounds (classification task),or variables of a system (regression task).Typical methods include neural networks [102],gaussian process regression (GPR) [103]and support vector machines(SVM) [104].On the other hand,unsupervised learning methods use unlabeled data to train models,which classify input data with no or little human intervention.Common methods include kmeans clustering [105],hierarchical clustering [106]and gaussian mixture model [107].The features of these two categories have been extensively covered in many textbooks and literature[67,108].

3.Machine Learning in Predicting Thermodynamics Properties

The development of chemical processes such as reaction,separation and purification rely on the thermodynamic properties of involved substances.Indeed,researchers have obtained massive data on thermodynamic properties.However,it is nearly impossible to collect the data for all substances and their combinations.Thus,a predictive model is critical.

Traditional predictive models were usually mathematical functions derived from‘‘hypothesis-driven”methods.Machine learning approaches provide an alternative and efficient way to predict thermodynamic properties,bypassing the mechanism and prior knowledge of the form of the equation of state.A large number of experimental data have given rise to many wide-ranging databases of thermodynamic properties such as the National Institute of Standards and Technology (NIST) chemistry webbook [71]and the Design Institute for Physical Properties (DIPPR) database [70].These databases could be used to generate the initial input of the machine learning process.The combination of machine learning and those databases have been used to investigate properties,including critical properties,enthalpy of phase change and other physical properties [31,32].

We will illustrate the ability of machine learning using two examples.The first example deploys neural network and support vector machine models to predict the density and viscosity of biofuel compounds.The second example predicts solvation free energy of organic solutes in generic solvents.

3.1.Liquid density and viscosity

Liquid density and viscosity are critical thermodynamic properties for petrochemical,aviation fuel,and other fields[109,110].The Creton group[111]developed machine learning models to predict density (ρ) and dynamic viscosity (η) of biofuel compounds.They used experimental data from the DIPPR database that contains 5634 ρ and 3547 η values of hydrocarbons and oxygenated compounds at a temperature ranging from 88 to 723 K.

They compared two sets of descriptors,including molecular descriptors and functional group count descriptors(FGCD).Molecular descriptors,including fast descriptors,spatial descriptors,DMol3 descriptors,Forcite energetics and Jurs descriptors were computed for every compound within the Materials Studio package [112].These descriptors distinguish chemistries according to their size,bonds,branches and dipolar moment,etc.Some of these descriptors,such as molecular weights,are obviously related to concerned properties,ρ and η.Other features may determine microscopic behaviors of molecules,for instance,hydrocarbons with a lower degree of branching are supposed to have closer packing ability,which affects molecular thermodynamics.FGCD contains the count of 26 different functional groups,including hydrogen,methyl group,secondary carbon,and methylene group,etc.Together with temperature as an additional descriptor,the most relevant features were selected by the forward selection method [113]for both molecular descriptors and FGCD.

Different machine learning models were compared in their ability to predict liquid density and viscosity,including the feedforward neural networks (FNN) and the support vector machines(SVM).FNN is one of the most primary artificial neural networks,consisted of the input layer,several hidden layers,and the output layer [114].Each layer is composed of many neurons,which are individual computation units.Neurons in these layers receive signals from each neuron in the previous layer,conduct linear computation and non-linear activation,and deliver the output to the next layer.The neurons have different parameters such as weights and biases,which would be optimized during the training process.SVM is another important model in the machine learning field.Combining with the kernel method [115],SVM can transform linearly inseparable data to linear separable ones in high-dimensional space.

They constructed four machine learning models from combinations of descriptors and algorithms.All the models yielded accurate predictions with the coefficient of determination (R2) higher than 0.98.FNN and SVM had a comparable performance for both ρ and η.Compared with molecular descriptors,FGCD model had a lower root mean squared error(RMSE)but a higher mean absolute error(MAE)for ρ.However,molecular descriptors provided better R2,RMSE and MAE than FGCD for η prediction.Then they built a consensus model by averaging prediction of different models.Fig.3 listed the statistical parameters of the consensus model and the plot of predicted results against experimental data.The results show that consensus models are more robust than individual ones.

Another study of Cai et al.[116]focuses on the viscosity versus temperature characteristic of pure hydrocarbons by using machine learning to predict the parameters of the empirical viscosity equation.They obtained the dynamic viscosity data of 261 pure hydrocarbons from the NIST database.These data were first used to regress an improved Andrade equation [117],which determines η(T)through only two parameters B and T0.Then a neural network was developed to predict B and T0by 35 chemical descriptors and molecular weights.They used 15 basic descriptors to represent all possible groups in hydrocarbons (paraffins,naphthenes,and aromatics),and 20 united groups to depict isomers.The machine learning model contains two 36-12-1 layers of neural network for B and T0,respectively.Dynamic viscosity of hydrocarbons η(T)was then estimated by the Andrade equation with predicted parameters.Fig.4 shows the scheme used in this work.

The prediction performance of B and T0was measured by the test set.The correlation coefficients of B and T0are 0.987 and 0.994,and the average relative errors are 4.08%and 2.50%,respectively.The predicted viscosities are in good agreement with the experimental data;however,comparing the predicted and experimental viscosity of several groups of homologs,they found that the predicted results are lower in the region with higher temperatures.To verify the ability to distinguish isomers,they used the trained model to predict η(T)of tetramethylpentane and ethyltoluene isomers.As shown in Fig.5,the Andrade equation with corrected parameters was able to estimate η(T) of different isomers.

This case demonstrates that machine learning could be applied at different steps of predicting system properties.Researchers can either use machine learning to regress the mapping from system features to properties,or use it to estimate the parameters in empirical equations.The advantage of the latter is that once the parameters are obtained,the equation would be convenient to reuse.

3.2.Solvation free energy

Solvation affects many chemical processes such as phase equilibrium.Solvation free energy is another important thermodynamic property and attracts researchers interests in the fields of chemical and biological engineering [118–120].Three steps determine the solvation free energy:(a)the formation of a cavity within the solvent,which is both entropically and enthalpically unfavorable;(b) the separation of solute particles from the solute bulk,which is entropically favorable but enthalpically unfavorable;and (c) solute particles enter the cavities of the solvent and interacts with the surrounding environment,which is favorable for both entropy and enthalpy.A lot of efforts have been made to derive solvation free energy from other thermodynamic properties or to calculate it with quantum mechanics.Some of these methods had made remarkable progress[121–123].However,the solvation theory is rather obscure and limits the application of theoretical methods.Machine learning would be a possible solution for its ability to discover information from data.

Several databases have collected solvation parameters,such as ESOL (Estimated Solubility) [41],FreeSolv (The Free Solvation Database) [124]and MNsol (The Minnesota Solvation Database)[125].The first two provide solubility and solvation free energy in aqueous solvents,while MNsol contains experimental data of 3037 solvation free energies (or transfer free energies),covering 790 solutes and 92 solvents.

Fig.3.Statistical parameters of the consensus model and the plot of predicted results against experimental data.Reprinted(adapted)with permission from[111].Copyright(2020) American Chemical Society.

Fig.4.Schematic structure of the neural network model used in[116].Two neural network models were trained to predict B and T0 ,which were then used to calculate η(T)through the Andrade equation.Reprinted (adapted) with permission from [116].Copyright (2020) American Chemical Society.

Fig.5.Comparison of experimental and predicted values of tetramethylpentane and ethyltoluene isomers using the neural network model and Andrade equation.Reprinted(adapted) with permission from [116].Copyright (2020) American Chemical Society.

Lim and co-workers developed Delfos (deep learning model for solvation free energies in generic organic solvents) [126]based on MNsol,demonstrating a machine learning approach that extracts dominant substructures from the solvation process.Firstly,they eliminated undesired data in MNsol,retaining 2495 pairs of solvation energy of 418 solutes and 91 solvents.Unlike molecular descriptors mentioned in previous cases,structure information can also be expressed as SMILES [92]string,then embedded into a sequence of vectors by proper script,for instance,Mol2Vec [90]used in this case.Mol2Vec originates from Word2Vec [127]in the field of NLP.This method learns the high-dimensional embedding of substructures from SMILES string and molecular fingerprints [91].The position of molecular structures in the embedded vector space is related to their chemical or physical properties.This new method has successfully used to express protein sequences and inspired some studies of molecular properties prediction[87,128].

Fig.6 demonstrates the primary architecture of Delfos.Bidirectional recurrent neural networks(BiRNNs)[129]were employed as an encoder to extract crucial features from both solutes and solvents vectors.Inspired by NLP,an attention layer was used to recognize important substructures from RNNs output.Attention mechanism [130]is an algorithm that can interpret the alignment relationship between input and output,and has been widely used to promote RNNs performance in speech recognition and machine translation fields.Recently,this algorithm draws chemists’ interests for the potential that it produces interpretable results from machine learning studies [131,132].In this study,the attention layer of Delfos extracts dominant factors that influence the solvation free energy.The results show that polarity and hydrophilicity of molecular substructures play important roles during solvation,which is consistent with chemical intuition.

Fig.6.The fundamental architecture of Delfos.Each encoder network has one embedding and one recurrent layer,while the predictor has a fully connected MLP layer.Two encoders share an attention layer,which weights outputs from recurrent layers.Black arrows indicate the flow of input data.Reprinted with permission from [126]–Published by The Royal Society of Chemistry.

Prediction errors of Delfos were measured through 10-folds cross-validation and cluster cross-validation.When the training set covered all substructures in the test set,the RMSE and the MAE were±0.57 kcal·mol-1(1 cal=4.1868 J)and±0.30 kcal·mol-1,respectively,and the Pearson correlation coefficient was up to 0.96.The prediction error was quite small and even as accurate as some other quantum mechanical hydration energy calculations [133].Compared with COSMO-RS [134],an advanced QM solvation method,Delfos provided similar accuracy for aqueous systems.At the same time,it showed better performance for non-aqueous organic solvents with MAE of 59% of COSMO-RS.Then the cluster cross-validation was used to measure the predictive performance with incomplete training sets.The clustering algorithm classifies solutes (or solvents) of the same family into different clusters.Thus,some substructures involved in the test set may not be trained during cross-validation.When a unique sub-structure exists only in the test set,the prediction deviation increases dramatically.

One may obtain insights from the machine-learning-based prediction of thermodynamic properties:(1) experimental chemical databases have significant potentials when combined with machine learning;(2) datasets should be thorough and cover as many as possible substructures within the test set;(3) though we should not expect data fitting to replace the exploration of internal mechanism completely,perhaps it will help researchers understand the underlying mechanism while providing more accurate and targeted data.

4.Machine Learning with High-throughput Molecular Simulation

Molecular simulations can provide massive data for machine learning.Some critical features are hard to obtained using experiments.For instance,there are few diffusion property datasets due to the experimental difficulty and nearly countless solute–solvent combinations.High-throughput frameworks enable simulations to produce datasets with several orders of magnitude larger than the experiments.Moreover,with some enhanced sampling algorithms(e.g.active learning) embedded into the high-throughput process,the dataset derived from simulation can obtain great diversity.

We will show how molecular simulation and high-throughput approaches can generate datasets and combine them with machine learning by two examples.The first one is about peptide design,and the second example will predict the thermodynamic properties of alkanes.

4.1.Peptide design

Peptides can self-assemble into ordered structures [135,136].Their assembly can be used to develop new materials with specific functions.The π-conjugated oligopeptides contain an aromatic core and symmetrical oligopeptide wings on both sides.These oligopeptides can aggregate into ordered structures using the specific interactions between the aromatic rings [137–140].DXXX-π-XXXD family has a π-core and oligopeptide wings with two terminal aspartic acids (Asp),and X represents one of 20 natural amino acids.The two exposed carboxyl groups of Asp provide pH-triggered assembly properties to the oligopeptide.The residues are protonated at a low pH,eliminating the electrostatic repulsion between adjacent molecules.

A large number of MD and DFT simulations have been conducted to discuss the effects of specific sequences and π-cores on the self-assembly process and optoelectronic properties.In a study of the Ferguson group[141],the self-assembling process of DFAGOPV3-GAFD oligopeptides was investigate by molecular dynamics simulation,in which OPV3 is oligo-phenylenevinylene.The aggregates tend to form the β-sheet-like stacks,which is driven by πstacking of aromatic cores,hydrophobic interaction,and hydrogen bonds between peptide chains.The photophysical properties of πconjugated peptides come from the electron delocalization near the π-core.Another DFT study [142]calculated similar oligopeptides,suggesting that the spectral absorption comes from the C 2p orbital transition,and large residues on peptide wings lead to a large twist angle between adjacent π-cores,thus reducing the electron transfer efficiency.

These properties are determined by π-cores and amino acid sequences,while the variety of sequences grows exponentially with the chain length.It is almost impossible for traditional experiments/simulations to cover the chemical space.Therefore,highthroughput computing and machine learning can be used to predict properties and direct peptide design.

The Ferguson group trained a machine learning model [143]to predict the self-assembly properties of DXXX-π-XXXD oligopeptides with different π-core,including naphthalenediimide (NDI)and perylenediimide (PDI).The assembly property was measured by an alignment metric determined by association distance and alignment distance between two oligopeptides.They calculated the oligomerization energies of 26 oligopeptides using MD simulations,and regressed the energy as a function of selected descriptors use multiple linear regression (MLR) model.Then the alignment metrics were obtained by fitting a bivariate Gaussian using oligomerization energies.They ultimately use this scheme to high-throughput screen the total 9826 sequences of the DXXX-π-XXXD family.

The selection of descriptors includes three steps.First,more than 5000 descriptor candidates were generated by the PaDEL software package [144].Then,247candidates were selected based on the following criteria:(1) descriptor should be stable and do not change dramatically with the transformation of peptide configuration,(2) descriptors should be informative,(3) redundant descriptors should be removed.Finally,eight descriptors were selected after stepwise forward selection.

The high-throughput screening results indicated that,compared with NDI core,the larger PDI core increased the oligomerization free energy by 15 kBT.The larger residues,especially phenylalanine (Phe),could increase the oligomerization free energy by 2.5 kBT.Even though the prediction of the alignment metric is only qualitative,it still predicts a promising candidate DAVG-PDI-GVAD with good self-assembly performance that has not been studied before.This work is less accurate for peptides with polar groups because of the small size of the training set and the insufficient number of residues covered;however,it shows the ability of MD-based machine learning for predicting property of complex systems.

A more recent work [145]combined active learning [146]and coarse-grained molecular dynamics (CGMD) simulation to predict peptide self-assembly property.This work studied the DXXXOPV3-XXXD oligopeptide family,in which OPV3 is oligophenylenevinylene.The research screens the chemical space of 1331 members.Fig.7 demonstrates the workflow of active learning and CGMD simulation.First,CGMD simulations were conducted for an initial sampling of 90 peptides to obtain their fitness function fi determined by the mean number of intermolecular π-core-π-core contacts in the terminal aggregates.Then,the variational autoencoder (VAE) [147]was used to generate the latent space embedding of the DXXX-OPV3-XXXD family.In short,VAE accepts the input(Ai,Ti)—the adjacency matrix and the composition vector;learns and constructs a Gaussian distributed latent space embedding which has a lower dimension;regenerates the tuplefrom the latent space.In the third step,Gaussian process regression(GPR)was used to train a surrogate model relating to the fitness function and peptide coordinates in the latent space.Finally,active learning was implemented to improve the accuracy of the GPR model.In the active learning step,CGMD simulation results of 4 extra cases were added to the model after every prediction,then the updated model was used to perform new prediction,thus a cycle was completed.The selection of extra peptides was guided by the previous model and Bayesian optimization.The researchers completed 25 additional cycles to obtain a stable GPR model,with a total sampling of 186 peptides and 5-fold cross-validation R2of 0.78.

Fig.7.The active learning workflow of [145].The cycle consists of four parts:CGMD simulations,latent space embedding,GPR training and active learning.Reprinted(adapted) with permission from [145].Copyright (2020) American Chemical Society.

The GPR model ranked 1331 amino acids sequences,revealing the key factors of self-assembling performance.The trends showed that amino acids with large aromatic residues,such as phenylalanine,tryptophan,and tyrosine,are more likely to hinder the π-core stacking,especially when these residues are located closest to the core.The highly ranked peptides contain abundant small hydrophobic residues.Contrary to previous understanding,the active learning model predicted that the presence of methionine at the nearest position to the π-core might improve selfassembly ability.At the same time,thioether groups were thought to disfavor hydrophobic interaction.

Using unsupervised spectral clustering,researchers compared the average composition of amino acid between peptides with different performance levels.The initial labeled 186 peptides sequences were firstly divided into three clusters—good,intermediate,and poor assemblers using agglomerative hierarchical clustering.Then the remaining 1145 peptides were assigned to these three clusters by nearest-neighbor assignation according to their positions in the VAE latent space.The average composition was analyzed over total 1331 peptides.Bulky aromatic residues tryptophan and phenylalanine were found disfavored in all positions in the sequence,and tyrosine was relatively moderate disfavored.The smaller hydrophobic residues,including glycine,isoleucine and leucine were enriched in good assemblers,while aspartic acid,glutamic acid,and methionine residues don not have an obvious effect on self-assembly performance.These results can be further interpreted as a steric hindrance effect or promotion/hindrance to core stacking,guiding the next step of researches.

4.2.Thermodynamic properties of alkanes

Alkanes are one of the most critical petroleum-based materials in chemical engineering.Obtaining their thermodynamic properties is necessary for chemical engineering design and discovery of new compounds.Sun and his group[46]developed a highthroughput force field simulation (HT-FFS) procedure.Combined with machine learning and neural networks,HT-FFS is used to calculate and predict the thermodynamic properties of alkanes.They calculated several selected properties of 876 common alkanes at 49,044 state points using molecular simulations(Fig.8).The simulations were verified by comparing with existing experimental data from NIST standard reference database [71].The structural descriptors were calculated by OpenBabel package[148].Together with temperature and pressure representing thermodynamics state point,the descriptor list has 25 initial descriptors.Then the number of descriptors was reduced using recursive feature elimination (RFE) procedure with a linear support vector machine(SVM) regressor.The key descriptors may be different for distinct properties.For example,the numbers of the tertiary carbon and quaternary carbon are the most sensitive features for Cp,corresponding to the influence of dispersion energy caused by these atoms,while the number of methyl groups attached to quaternary carbon is the most decisive factor for density.

The Sun group then developed an FNN model(Fig.9)to predict selected thermodynamic properties,including density (ρ),intermolecular energy (Ei) and isobaric heat capacity (Cp).The training of the model was based on HT-FFS results.They divided these results into three parts,the training set (70%),validation set(20%) and the test set (10%).

Fig.8.The workflow of HT-FFS used in [46].Reprinted (adapted) with permission from [46].Copyright (2020) American Chemical Society.

Fig.9.Schematic representation of FNN with N-1 hidden layers.Reprinted(adapted)with permission from[46].Copyright(2020)American Chemical Society.

The prediction performance was measured by the deviation between the predicted value and the HT-FFS result.For all three thermodynamics properties,ρ,Eiand Cp,FNN prediction is in excellent agreement with simulation.The RMSEs in the test set are 3.1×10-3g·cm-3for ρ,0.30 kJ·mol-1for Eiand 2.8 J·mol-1·K-1for Cp,and the maximum absolute relative error (MARE) is less than 0.75% for all properties.Due to the high-throughput procedure,much more data could be obtained to train the machine learning model in this case than experimental studies.Since the scale of datasets influences the accuracy of models,HT methods and procedures are significant for data preparation.This case shows the potential of big-data-based machine learning in predicting the properties of compounds,though FNN is a relatively primary neural network.More importantly,similar workflows could be used for other systems and features.

5.Machine Learning Force Fields

The machine learning force field could investigate the thermodynamics of complex systems with an accuracy of quantum mechanical calculations and efficiency of classical MD.The new force fields are used to investigate the behavior of different systems ranging from simple water [149,150]to complex organic and inorganic materials [151–154].In the training process,quantum mechanical calculations were performed to generate the data set that contains a batch of atomic coordinate and corresponding properties like the system energy.Descriptors were deployed to extract local structural interrelations of multiple atoms from the perspective of ‘‘many-body”for the machine learning force field[155].Machine learning algorithms obtain the potential energy surface using structural descriptors as input.Therefore,the machine learning force field can describe complex interactions of systems accurately at the cost of classical MD simulations.We will introduce the applications of machine learning force fields using two examples:(a) an all-atom machine learning force field for Li-Si alloys and (b) a coarse-grained machine learning force field for water molecules.

5.1.All-atom machine learning force field

Xu et al.[155]developed a machine learning force field for crystalline and amorphous Li-Si alloys with Li/Si ratio ranging from 0 to 4.2.MD simulations driven by the force field predict the volume change during the initial lithiation,consistent with the experiments.The authors also reported that the force field could predict bulk densities,the radial distribution functions,and the diffusivity of Li in amorphous Li-Si systems fast and well.The accuracy approaches quantum mechanical calculations while the speed is 20 times faster than the quantum mechanical MD simulations.

Five typical alloys,Li1Si64,Li1Si3,Li1Si1,Li13Si4,and Li54Si1were used as the starting structures in the construction of training data sets.Different Li/Si ratio in the data sets enables the machine learning force field to describe Li-Si alloys of various Li concentrations.A melting-quenching algorithm was used to generate amorphous samples of these components to enable descriptions of both crystalline and amorphous phases.

The Behler’s atom-centered symmetry functions(ACSFs)[96,98]were used as descriptors to extract the local environment Diof each atom i.Then Diis passed to a neural network to determine the atomic energy Eiof atom i.The potential energy of the system is obtained by summing Eiover each atom.The parameters of the network are optimized via a supervised regression using the potential energies in the data set as a target.Fig.10a shows a configuration of the optimized Li1Si64and Fig.10b depicts a scheme to encode the local environment of the Li atom in Li1Si64.

The new machine learning force fields are promising in the molecular simulation from two aspects.First,it lowers the threshold of developing qualified force fields for complex systems and accelerates the development significantly.Second,it provides an avenue to investigate long time-scale processes like reactions at an accurate level.However,some issues remain to be solved during the development of machine learning force fields.One typical issue is how to build a data set with reasonable data distribution.He and coauthors [155]proposed several methods to solve the issue.

5.2.Machine learning force field for water molecules

The machine learning force field is capable of describing free energy change accurately.Capturing a long-time phase-change process is challenging for current all-atom force fields in an MD simulation.Molinero et al.[156,157]reported a coarse-grained simulation of water crystallization using the mW monatomic force field.With no explicit hydrogen involved,the CG model can accelerate the ice nucleation process by several orders of magnitude.The Landau free energy is used to represent phase properties of the phase-changing model.However,the dependence of the Landau free energy on CG variables needs ad hoc human efforts.Based on the all-atom machine learning force field,Zhang et al.[150]proposed a coarse-grained machine learning force field to obtain the Landau free energy only from the coordinates of the CG particles.

They developed a coarse-grained machine learning force field for water.Each water molecule was treated as a CG particle,which centered at the position of the oxygen atom.The authors found that the oxygen correlation functions produced by the coarsegrained machine learning force field agreed well with those calculated by the all-atom machine learning force field.The coarsegrained has a speed 7.5 times as fast as the all-atom force field while retaining the accuracy of describing the structural features.

The data sets were constructed using data from MD simulations driven by both quantum mechanical and all-atom machine learning force field for water.The descriptors are similar to those in the all-atom force field.However,the descriptors were implemented on all CG particles not individual atoms.The local environment Diof each particle,i was inputted into a neural network.The sum of the outputs from the neural networks gives the total CG potentialU.The total CG potential is the Landau free energy and is not directly available from data sets.Therefore,they constructed a loss function of the neural networks elaborately through a forcematching approach algorithm[150].The gradient of U with respect to the position of each CG particle was compared to the mean force experienced by the corresponding CG particle from a conditioned ensemble in the data set.The samples in the ensemble have the same reduced CG variable set.The coarse-grained machine learning force field has a benefit of 7.5 times fast speed than the allatom force field.The huge improvement in speed shows that the coarse-grained machine learning force field has the ability investigate the dynamics properties of big molecules like polymers.The explicit consideration of the Landau free energy enables the coarse-grained machine learning force field to capture phasechanging process in MD simulations (see Fig.11).

Fig.10.(a) A snapshot of Li1 Si64 used for the construction of the initial dataset.Li atom is shown as a purple ball,while Si atoms shown as blue balls.(b) A scheme of calculating the atomic energy of Li from its local environment using a neural network.Reprinted(adapted)with permission from[155].Copyright(2020)American Chemical Society.

Fig.11.(a) Schematic plot of extracting descriptors for CG particle i.O atoms are shown as red balls,while H atoms are shown as white balls.Purple balls,centered at the positions of the O atoms,denote CG particles of water molecules.(b)A scheme of calculating the Landau free energy of particle i from its descriptor using a neural network.Reprinted from [150],with the permission of AIP Publishing.

6.Perspective

The above examples have shown the promising role of machine learning in predicting and investigating molecular thermodynamics of chemical engineering.Material design also benefits from the combination of machine learning and high-throughput simulations.Moreover,the machine learning force field enables the computation of thermodynamics in complex systems.However,application of machine learning in chemical engineering remain in a primitive stage,mainly due to the characteristics of datasets in this field.Obtaining the exact properties of compounds can be expensive,so unlike models in NLP or computer vision,prediction models in chemistry and material science are often based on small datasets(thousands of data or even fewer).The challenges include the extrapolation performance and transferability of the model.Thus,establishing databases with high data diversity and reliability is of great significance for future machine learning applications.We expect more applications that combine high-throughput simulation and machine learning.Another issue is the interpretability of machine learning models.For example,the weights of different nodes in deep neural networks are difficult to be directly related to the determinants of molecular properties.Emerging algorithms computer science can provide a reference for chemical engineering.In recent years,concepts such as active learning,convolutional neural network,and attention mechanism have been adopted by chemists.These advanced methods provide new opportunities for broader applications of machine learning in molecular thermodynamics.

Machine learning potential is another important application in the future since it has advantages of both quantum mechanics and all-atomic molecular dynamics.Current machine learning potential still faces some challenges,such as the limitation of element type and being sensitive with input configurations.Many machine learning potentials have been developed and successfully applied to different systems,and it is becoming a powerful supplement for computational chemistry.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

Jiaqi Ding,Nan Xu,Dr.Yao Shi and Dr.Yi He acknowledge financial supports from the National Natural Science Foundation of China (21676245 and 51933009),the National Key Research and Development Program of China (2017YFB0702502),and the Leading Innovative and Entrepreneur Team Introduction Program of Zhejiang (2019R01006).Manh Tien Nguyen,Dr.Qi Qiao and Dr.Qing Shao thank the financial support provided by the Startup Funds of the University of Kentucky.

Chinese Journal of Chemical Engineering2021年3期

Chinese Journal of Chemical Engineering的其它文章: Molecular simulations of charged complex fluids:A review; Thermodynamic analysis and modification of Gibbs–Thomson equation for melting point depression of metal nanoparticles; Estimating Hansen solubility parameters of organic pigments by group contribution methods; Effect of dimethyl carbonate on the behavior of water confined in carbon nanotube; Prediction and verification of heat capacities for pure ionic liquids; Determination of the metastable zone and induction time of thiourea for cooling crystallization