Tracking External Boltz-2 Benchmarks

by Corin Wagen · Jul 1, 2025

Three weeks ago, a team of scientists from MIT and Recursion released Boltz-2, a co-folding model which not only predicts the structure of bound protein–ligand complexes but also "approaches the accuracy of FEP-based methods" for binding-affinity prediction. This is an extraordinary claim, and one which prompted thousands of scientists (including us) to start investigating Boltz-2 for structure-based drug design. (For a more detailed look at how Boltz-2 works and the potential uses, read our full FAQ.)

Over the past few weeks, a variety of scientific teams have disclosed external benchmarks of Boltz-2. This field is moving incredibly fast, so these benchmarks are hard to keep track of: some happen on LinkedIn, while others are on X or various blogs around the Internet. To make it easier for our users to keep track of the latest updates surrounding Boltz-2, we've compiled the most relevant data on this page. Although it's still early—it hasn't even been a month since Boltz-2 was released—the model's strengths and limitations are gradually becoming clear. (Note: we're excluding random posts of single structures here, since most of these lack clear systematic comparisons to experiment.)

This is a living document, and will be updated as additional benchmarks are released. This page last updated July 1st.

PL-REX Benchmark (Semen Yesylevskyy)

This benchmark, posted on LinkedIn a week ago, evaluates the performance of Boltz-2 against a variety of physics- and ML-based methods on the 2024 PL-REX dataset. This is a "best case" scenario for physics-based methods, since the protein–ligand complex is known with relatively high confidence for these systems.

Yesylevskyy compared the Pearson correlation coefficient of all methods for ranking the relative affinity of different binders. He found that the SQM 2.20 method (for which the PL-REX dataset was developed) significantly outperformed all other methods, with Boltz-2 coming in second place.

Chen's Boltz-2 benchmarks with buried water.

Comparison of a variety of methods on the PL-REX binding-affinity benchmark.

Here's what Yesylevskyy has to say about this:

Boltz-2 scores the second being only 5-7% better than the closest ML competitor ΔvinaRF20 and the closest physics-based competitors GlideSP and Gold ChemPLP. Boltz-2 is still far cry below SQM2.20 and only reaches mean correlation of ~0.42 with experimental values... So, according to this test, Boltz-2 is only an incremental improvement over existing affinity prediction techniques rather than a revolution. Moreover, its inference speed was rather disappointing in our tests being an order of magnitude slower than conventional docking programs such as Vina or Glide.

It's worth noting that although SQM 2.20 performs well on this benchmark, a similar semiempirical method was recently shown to perform poorly on the ULVSH virtual screening dataset.

Uni-FEP Benchmark (Xi Chen)

On LinkedIn, Xi Chen and co-workers from Atombeat recently disclosed benchmark results for Boltz-2 on the Uni-FEP dataset. This benchmark set comprises approximately 350 proteins and 5800 ligands.

Chen reports that Boltz-2 gives "consistently strong results — measured by both correlation terms and mean error terms— across 15 protein families," including cases where conformational effects are significant, like GPCRs and kinases. Unfortunately, Boltz-2 significantly lagged FEP in cases where buried water was known to be important, a sign that these effects are not implicitly accounted for by the model:

Chen's Boltz-2 benchmarks with buried water.

Comparison of Boltz-2 to FEP in cases where buried water is important.

Another interesting observation is that Boltz-2 consistently underestimates the spread of binding affinities present in experimental data. In the below two cases, the predicted range of binding affinities is significantly tighter than either the observed experimental values or the predictions from the conventional physics-based FEP workflow:

Chen's Boltz-2 benchmarks showing affinity compression.

Comparison of Boltz-2 to FEP, illustrating the propensity of Boltz-2 to compress affinity values.

Here's what Chen has to say:

One general trend we observed — independent of specific targets — is Boltz-2's tendency to predict binding affinities within a narrow range, typically within 2 kcal/mol. Figures 5a and 5b illustrate examples. We found this behavior on 75 of the 350 targets evaluated. For 21 of those, the experimental binding affinities spanned more than 4 kcal/mol — yet Boltz-2 clustered predictions near the mean, effectively “regressing to the center.”

Similar observations were recently reported by John Parkhill on X.

Six Protein–Ligand Systems (Tushar Modi et al.)

Tushar Modi and co-workers at Deep Mirror recently disclosed benchmarks for six protein–ligand systems. Their overall conclusions were that Boltz-2 did well for stable and rigid systems, but struggled with ligand geometries or in cases where conformational flexibility was important:

Boltz-2 often has difficulty when a protein must undergo a big shape change or has multiple mobile domains with little precedent in the training data. If a protein needs to bend into a new shape to accommodate a ligand (like the allosteric changes in PI3K-α or WRN, or the dynamic binding required in cGAS), the unguided model usually fails to predict that rearrangement. These cases often require additional help—such as supplying a template of the alternate conformation or running a refinement step—to obtain the correct pose.

Note that this conclusion is the exact opposite of what Xi Chen noted above.

Conclusions

While this field is moving fast, some tentative conclusions can be drawn. Here's our current thinking on Boltz-2:

When used properly, it's likely that Boltz-2 can be a very useful tool in the drug-discovery arsenal; but it's not a solution in isolation, and likely needs to be embedded in a proper virtual-screening workflow to give useful results.

Addendum: Chai-2

Yesterday, Chai-2 was released. Although minimal technical details were disclosed, Chai-2 appears to be a co-folding-based workflow involving a sequence of models and physics-based steps that can be used for zero-shot antibody design. In combination with Adaptyv Bio, the Chai-2 authors reported a 50% wet-lab success rate against a panel of 52 diverse protein targets; the full technical report gives more target details.

Visual summary of Chai-2.

Figure 1 from the Chai-2 technical report.

Since Boltz-1 and Chai-1 were virtually clones, it's interesting to reflect on the ways these two projects have evolved. Boltz-2 has focused on small molecules and binding-affinity prediction within a single model, while Chai-2 has expanded into an entire end-to-end pipeline and seems to be focusing on antibody/nanobody design. It will be interesting to see where both projects go next!

Banner background image

What to Read Next

Efficient Black-Box Prediction of Hydrogen-Bond-Donor and Acceptor Strength

Efficient Black-Box Prediction of Hydrogen-Bond-Donor and Acceptor Strength

Here, we report a robust black-box workflow for predicting site-specific hydrogen-bond basicity and acidity in organic molecules with minimal computational cost.
Jul 1, 2025 · Corin C. Wagen
Tracking Boltz-2 Benchmarks

Tracking Boltz-2 Benchmarks

Tracking the community's response to the new Boltz-2 model, plus some notes about Chai-2.
Jul 1, 2025 · Corin Wagen
g-xTB, Credit Usage, & More

g-xTB, Credit Usage, & More

the new g-xTB model from Grimme and co-workers; an easy visual overview of credit usage; better credit handling for organizations; bulk PDB download; a new collapsible JSON viewer
Jun 27, 2025 · Jonathon Vandezande, Ari Wagen, Spencer Schneider, and Corin Wagen
Representing Local Protein Environments With Atomistic Foundation Models

Representing Local Protein Environments With Atomistic Foundation Models

A guest post about how to use NNP embeddings for other prediction tasks.
Jun 20, 2025 · Meital Bojan and Sanketh Vedula
Co-Folding Updates

Co-Folding Updates

Boltz-2 FAQ and launch event recap; new visuals for co-folding workflows; new submission options; PDB bugfixes; new credit-management tools
Jun 12, 2025 · Ari Wagen, Spencer Schneider, and Corin Wagen
The Boltz-2 FAQ

The Boltz-2 FAQ

Questions and answers about the Boltz-2 biomolecular foundation model.
Jun 9, 2025 · Corin Wagen and Ari Wagen
Cleaning the Tap Room

Cleaning the Tap Room

beer and bezos; terms-of-service and privacy-policy updates; more deployment options; compliance requirements and country restrictions; a blog post about transition states
Jun 6, 2025 · Ari Wagen and Corin Wagen
BREAKING: Boltz-2 Now Live On Rowan

BREAKING: Boltz-2 Now Live On Rowan

This morning, a team of researchers from MIT and Recursion released Boltz-2, an open-source protein–ligand co-folding model.
Jun 6, 2025 · Corin Wagen, Spencer Schneider, and Ari Wagen
How to Run Boltz-2

How to Run Boltz-2

Step-by-step guides on how to run the Boltz-2 model locally and through Rowan's computational_chemistry platform.
Jun 6, 2025 · Corin Wagen
Guessing Transition States

Guessing Transition States

Methods for generating guess transition states for reaction modeling.
Jun 5, 2025 · Jonathon Vandezande