This preprint can also be viewed on ChemRxiv.
Knowing the pKa of a molecule is key to understanding its structure and reactivity. In medicinal chemistry, pKa values are used to predict a variety of important pharmacological properties, including solubility, hERG potassium channel blocking, phospholipidosis risk, CYP inhibition, and blood–brain-barrier penetration.1–6 Efficient and accurate pKa estimates play a pivotal role in the design of compounds with the desired physicochemical and pharmacological properties, and accordingly pKa prediction has been recognized as a problem of "immense interest"7 in computational chemistry for decades.8,9
The most common approach to pKa prediction is to develop an empirical quantitative structure–property relationship (QSPR) based on experimentally measured data and use this model to predict pKa values for new compounds.10 QSPR approaches are typically fast and display excellent accuracy for compounds like those in the training data, making them an appealing choice for high-throughput pKa prediction.11 However, these methods struggle to extrapolate to unseen compounds and are generally unable to capture factors that arise from specific configurations of molecules in three-dimensional space, like conformational or stereoelectronic effects.12
A contrasting approach is to directly predict pKa values using electronic structure theory, since microscopic pKa values are directly proportional to the ∆G of the underlying acid dissociation reactions. This strategy possesses certain advantages over QSPR approaches: conformational and stereochemical effects are naturally incorporated, and new regions of chemical space can be modeled without new training data. In recent pKa prediction challenges, approaches based on electronic structure theory have consistently been among the best-performing methods: in SAMPL6, submission xvxzd achieved a mean absolute error (MAE) of 0.58 pKa units and a root-mean-squared error (RMSE) of 0.68,13 while submission EC_RISM achieved an MAE of 0.53 and an RMSE of 0.73 in SAMPL7.14
Unfortunately, many electronic structure theory-based prediction approaches are too computationally costly for routine usage: in their paper describing the xvxzd submission to SAMPL6, Pracht and Grimme write that "a macroscopic pKa value for a molecule could be calculated within approximately one day on a 28 CPU node,"15 which corresponds to roughly $25–30 with current cloud computing prices.16 While semiempirical methods like PM617and GFN2-xTB18 have been used for rapid pKa prediction, the resulting workflows display lower accuracy and often require judicious choice of reference compound or significant reparameterization. This is unsurprising: calculation of acid dissociation constants is "demanding and arduous,"19 and both a sophisticated electronic structure method and a sizable basis set are required to consistently get accurate results even in the gas phase.20
Machine-learned interatomic potentials have recently emerged as an efficient alternative to conventional electronic structure theory.21 We envisioned that AIMNet2, which displays accuracy comparable to routine density-functional theory methods for main-group thermochemistry,22 might allow for rapid computation of pKa values while preserving the advantages of electronic structure-based approaches. Here, we report the successful implementation of this strategy into a workflow ("Rowan pKa"), which uses AIMNet2 to calculate microscopic pKa values with good accuracy and minimal empiricism.
Our workflow begins by iteratively adding or removing hydrogens to the molecule of interest using RDKit and quickly estimating the proton affinity for the conjugate acid/conjugate base pair. If (de)protonation is predicted to yield a pKa value close to the desired range, conformers are generated using the ETKDG algorithm23,24 and optimized using the MMFF forcefield. After removing redundant geometries, low-energy conformers are then optimized using the GFN2-xTB semiempirical method,25 single-point energies are calculated using the AIMNet2 model trained on ωB97M-D3/def2-TZVPP (henceforth abbreviated as AN2-ωB97MD3),22 and the lowest N undergo full optimization and frequency calculations at the AN2-ωB97MD3 level of theory to generate a free energy for each conformer, including all standard entropy- and symmetry-based corrections.26 (N is set to 10 by default but can be varied by the end user; vide infra.) Since AIMNet2 does not take solvation into account, ∆Gsolv is computed from a single-point GFN2-xTB calculation with the CPCM-X implicit water model.27
Following Pracht and Grimme, we combine energies from each conformer to generate a single hybrid ∆G for the conjugate acid/conjugate base pair and apply a quadratic free-energy relationship to convert this into pKa values.18 Initial application of this strategy provided disappointing results, with considerable systematic error observed between different classes of acids. Functional group-dependent systematic error is common in ab initio pKa prediction, and many different corrections have been employed: Schrödinger’s Jaguar pKa contains a different linear free-energy relationship for roughly a hundred different functional groups,28 while Jensen and co-workers employed different reference molecules for each class of compound under study.17 To minimize the number of empirical parameters employed in our workflow, we instead add an element- and valence-specific constant to ∆G for each conjugate acid/conjugate base pair. (The correction for divalent oxygen, i.e. deprotonation of ROH, is defined as zero to eliminate extra degrees of freedom.) These constants are determined by Levenberg–Marquardt least-squares minimization, not by comparison to any individual reference compound, and can be thought of as a correction for systematic errors in both thermochemistry and solvation. Similar corrections are often used to account for dispersion in mean-field electronic structure theory methods without substantially compromising the generality of the method.29
Workflow development and parameter optimization was conducted on a slightly modified version of the TR224 dataset, also used by Pracht and Grimme.18 The final Rowan pKa workflow contains 7 tunable parameters, shown in Table 1.
Table 1. Tunable Parameters for Rowan pKa | |
---|---|
Coefficients for Free-Energy Relationship | |
C0 | -123.242550 |
C1 | 0.50677935 mol/kcal |
C2 | -1.7401e-04 mol2/kcal2 |
Element-/Valence-Specific Constants (kcal/mol) | |
N3 | 3.91398 |
N4 | 5.13054 |
C4 | 6.03872 |
O2 | 0.00 (defined) |
S2 | -5.14438 |
Quantifying the accuracy of property prediction under realistic conditions is challenging: evaluation over a large dataset, while superficially appealing, often leads to exaggerated correlation coefficients that are not reproduced in smaller, less diverse populations.30 Instead, we follow Rich and Birnbaum in using "assay-stratified metrics" to evaluate the predictive ability of our pKa workflow.31 All datasets were scored using four metrics: mean absolute error (MAE), root mean square error (RMSE), the square of the Pearson correlation coefficient (r2), and Kendall’s tau (τ). Low values for MAE and RMSE reflect a high degree of absolute accuracy, while high values for r2 and τ demonstrate relative accuracy within a given dataset (arguably more predictive of performance under real-world usage).
We began by evaluating Rowan pKa on datasets from the SAMPL6 and SAMPL7 blind pKa prediction challenges run several years ago. The SAMPL6 dataset consists of 24 molecules designed to resemble fragments of kinase inhibitors, many of which contain multiple potential sites of (de)protonation.13 We computed microscopic pKa values starting from each molecule’s neutral state and compared these values to the relevant macroscopic pKa values, excluding doubly ionized states. Rowan pKa performed well, with an MAE of 0.94 and an RMSE of 1.21, comparable to the performance of commercial pKa prediction tools like ChemAxon’s Chemicalize and Schrödinger’s Jaguar and Epik.32 The high values obtained for r2 and τ also illustrate the predictive utility of Rowan pKa, although the large range of pKa values in the SAMPL6 dataset makes obtaining a high value for r2 relatively easy.
We next evaluated Rowan pKa on three datasets selected to showcase changes in amine basicity. In 2010, Müller and co-workers reported that introducing spiro-oxetane substituents in saturated N-heterocycles could attenuate the basicity of amines.33 We modeled their dataset using Rowan pKa, with the change that the bulky N-piperonyl group (added as a UV-absorbing handle for experimental convenience) was replaced with N-methyl in silico. Overall, good agreement between experimental and predicted pKa was observed, although the effect of β-oxetane substituents was systematically underestimated.
We next examined the ability of Rowan to recapitulate pKa values from the medicinal chemistry literature. In 1972, Miller, Doukas, and Seydel reported that N-aryl sulfonamides served as potent inhibitors of folate synthesis and found that the inhibitory activity could be correlated with the pKa of the sulfonamide N–H.36 The Rowan pKa workflow recapitulates the experimental trend in pKa values and thus also the observed trend in potency, illustrating the utility of pKa calculations in this system.
We next examined the ability of Rowan pKa to predict the pKa of macrocyclic peptides. Macrocyclic peptides inhabit different conformations from their linear congeners and frequently possess markedly different pKa values, with significant consequences for solubility and bioavailability. In 2018, Yudin and co-workers reported that the pKa of reduced amide bonds adjacent to oxadiazole rings was lowered in macrocycles, which they attributed to intramolecular hydrogen bonding that stabilizes the neutral amine.40 Rowan pKa was able to recapitulate the observed trend in pKa values (Figure 10), which would be considerably more challenging for QSPR-type methods that do not explicitly consider conformational effects.
Finally, we explored the suitability of Rowan pKa as a high-throughput method for pKa prediction. We selected the first 100 rings from Peter Ertl’s recently published library of medicinal chemistry-relevant ring systems as a test set (15.5 atoms/molecule on average) and investigated whether the accuracy of the conformation search/refinement process could be further reduced to increase throughput without significantly hurting the workflow’s accuracy.41 Two modes with reduced conformer searching were eventually developed: the key changes made for each new mode are summarized in Table 2. Modest gains in efficiency are possible, but at the cost of significantly increased error: if multiple low-energy conformers exist, these data suggest that they should be included.42
Table 2. Different Modes for Conformer Searching | |||
---|---|---|---|
Parameter | Careful | Rapid | Reckless |
# initial confs | 250 | 100 | 50 |
initial energy cutoff (kcal/mol) | 15.0 | 10.0 | 5.0 |
RMSD threshold (Ã…) | 0.10 | 0.25 | 0.50 |
# confs. (xTB) | 20 | 10 | 3 |
final energy cutoff (kcal/mol) | 5.0 | 5.0 | 3.0 |
# confs. (AIMNet2) | 10 | 3 | 1 |
final energy cutoff (kcal/mol) | 15.0 | 10.0 | 5.0 |
MAE (relative to "Careful") | 0.00 | 0.38 | 0.69 |
RMSE (relative to "Careful") | 0.00 | 1.17 | 1.66 |
Average Time (s)a | 23.05 | 16.01 | 14.13 |
In this work, we show that relatively aggressive approximations can be applied to "gold standard" approaches like Grimme’s xvxzd workflow while still maintaining useful accuracy. The xvxzd workflow consists of extensive metadynamics-based conformation searching in crest, geometry optimization with PBEh-3c in implicit solvent, single-point energy calculations using the double-hybrid functional DSD-BLYP-D3(BJ)/def2-TZVPD, and subsequent implicit solvent calculations using COSMO-RS; here, we replace crest with RDKIT’s ETKDG conformer generation algorithm, DSD-BLYP-D3(BJ) with AIMNet2, and COSMO-RS with CPCM-X.15 While Rowan pKa is not as fast as QSPR-based approaches or as accurate as high-level electronic structure theory-based approaches like xvxzd, the combination of accuracy, speed, and minimal empiricism achieved by our workflow is nevertheless appealing. We anticipate that the ability to obtain fast and reasonable pKa values for unseen regions of chemical space will be particularly useful in drug discovery. Owing to the lack of functional group-specific corrections, the pKa values computed by Rowan are ideally suited to downstream manipulation or scaling, which can further reduce systematic error where desired.
What limits the accuracy of the Rowan pKa workflow, and what can be done to improve it? At a high level, errors in ab initio microscopic pKa prediction can come from several different sources: (1) insufficient consideration of conformational effects or analysis of the wrong conformations, (2) poor description of the underlying gas-phase thermochemistry, or (3) inaccurate description of solvation. To achieve maximum efficiency, a Pareto-optimal pKa prediction workflow ought to balance errors from all three sources without "overspending" for a high-accuracy answer in only one area. In specific cases, we have been able to track down anomalous predictions made by Rowan pKa to all three underlying potential sources of error—conformational searching, gas phase thermochemistry, and implicit solvent model—indicating that no factor is solely responsible for the observed error. While improvements to conformational generation44 and machine-learned interatomic potentials45 are easy to envision based on current research, improvements to implicit solvation may prove to be more elusive: despite repeated calls for alternatives to "scandalous" implicit solvation,46 no practical and general alternatives have yet been developed. Recent work in coarse-grained machine-learned potentials may point towards a new paradigm for handling solvent effects, but more work is needed to investigate the strengths and limitations of such approaches.48,49
More generally, the approach described in this paper divides the task of predicting pKa into two separate components: (1) a general pre-trained model which maps molecular structures to energies, thus "learning chemistry" at a high level, and (2) an application-specific workflow which translates geometries and energies generated by the pre-trained model into predictions relevant for the task at hand. This partitioning has certain advantages: the general pre-trained model (here AIMNet2) is systematically improvable and can be trained using purely in silico data, in principle allowing arbitrarily high levels of accuracy to be attained, while the application-specific workflow relies largely on conventional computational chemistry techniques and does not require extensive experimental training data like ML-based QSPR methods. As the accuracy of machine-learned interatomic potentials improves, we anticipate that the strategy demonstrated here—using machine-learned interatomic potentials as drop-in replacements for quantum chemistry—will become a general strategy for accelerating computational chemistry.
The authors acknowledge Abba Leffler, Diego Diaz, Eugene Kwan, Joe Gair, Pat Walters, and Philipp Pracht for helpful discussions.