Improving Rowan's Performance on the OpenBind EV-A71 Release

by Corin Wagen · May 20, 2026

Last week, we posted our day-one results from the OpenBind data release; we selected a 76-compound subset, used analogue docking to generate poses from a random template structure, and ran Rowan's RBFE workflow against these poses to predict relative free energy values. (The input structures are available on GitHub.)

Our initial results were mixed. We successfully confirmed the sanity of our end-to-end pipeline: over 50% of our selected poses were within 1 Å of the crystallographic pose, and RBFE calculations starting from both the crystallographic and docked poses were within 1 kcal/mol MAE of the experimental values (ruling out any catastrophic failure scenarios). On the other hand, the results were predictively useless, with essentially no ability to rank-order binders.

An external observer might reasonably ask: why put such bad results on your blog? Doesn't this make your software look terrible? We share these results because we think it will be helpful to our customers. While we'd prefer Rowan's FEP to work perfectly out of the box every single time, this unfortunately doesn't happen in real life, and it's common to find cases where protocol tuning is needed to get useful FEP results—transparently documenting this process should be useful to external observers.

Here's what we wrote in the original post:

We expect that further protocol tuning or FEP improvements will be able to produce improved results; we haven't done any project-specific parameter tuning here and report these results simply as a zero-shot baseline about what's possible out of the box with FEP.

It's entirely possible that trivial modifications to our protocol will lead to improved performance. We've worked on this project for fewer than 24 hours, and report these data with the hope that others will do the same and push the field forward. Benchmarks drive scientific innovation, and tough and diverse benchmarks like OpenBind are exactly what's needed to push the free-energy field towards increased real-world predictive accuracy.

In this post, we'll share how we achieved useful FEP performance on a subset of our data: what worked, what didn't work, and what we plan to investigate moving forward.

Simple Protocol Changes

We first tried simply adjusting the RBFE protocol along a few dimensions, keeping the same 76 docked poses (pKD range 5.03–7.94) and the same 138-edge RBFE graph:

runmain changeRMSE (kcal/mol)MAE (kcal/mol)PearsonSpearmanKendall
1(baseline)1.0640.8100.0710.0650.039
2AM1BCC charges1.1080.863-0.071-0.104-0.073
3AM1BCC + 4 ns windows1.0710.839-0.011-0.015-0.009
4AM1BCC + 350 local steps0.9870.7910.1150.0320.029
5AM1BCC + no local steps0.9130.7010.3040.2730.187

Pyrrolidine Subset Analysis

At this point, it seemed unlikely that further protocol changes were going to magically give us great performance on the full 76-compound set. We looked through the compounds to see if some smaller subset of the data was well-described at the "rigorous" level of theory. which would let us zoom in and better understand where these errors were coming from.

We identified a 32-compound subset of compounds which all shared a common pyrrolidine scaffold and which seemed to perform better at the rigorous level of theory (Spearman rho of 0.67, Kendall tau of 0.51). To double-check our poses, we also redocked all compounds to x7026b (shown below), a relatively unsubstituted member of the series.

The docking template compound.

Compound x7026b, a simple member of the pyrrolidine series.

The resulting poses overlaid very well, with only the substituents on the pyrrolidine really varying.

The overlaid pyrrolidine-series compounds.

The selected pose for all 32 pyrrolidine binders, overlaid.

We ran eight more RBFE runs focusing on the 32 pyrrolidine structures.

runpose sourcesettingsRMSE (kcal/mol)MAE (kcal/mol)PearsonSpearmanKendallRuntime (h)
6dockeddefault/NAGL1.1190.8700.2810.2280.1574.38
7dockeddefault/AM1-BCC1.0320.8170.4150.3500.2704.92
8docked350 local (2 nm) /AM1-BCC1.0670.8600.3440.2740.2068.93
9dockedrigorous/AM1-BCC1.0620.8380.3480.3610.27818.72
10redockedrigorous/AM1-BCC1.2040.9300.2120.0870.07118.39
11crystaldefault/NAGL1.2680.981-0.006-0.010-0.0124.63
12crystaldefault/AM1-BCC1.1970.9460.0870.0380.0465.21
13crystalrigorous/AM1-BCC1.0260.7460.4260.4640.39920.06

Overall, the best run was the rigorous run starting from crystallographic poses (run 13), although other protocols achieved non-zero accuracy (better than in the full 76-compound set above). Here's the scatter plot for the best run—not amazing, but there's now some predictive power.

The parity plot with RBFE results.

Results from RBFE Run #13.

Pose Error

The above data shows that the crystallographic poses were consistently better than the docked poses (and different docking runs gave very different results). We wanted to understand this more: since prospective FEP usage can rarely benefit from extensive crystallographic support, getting a robust pose-preparation pipeline is critical.

To start, we used Rowan's strain workflow to check the strain of both sets of poses (the full set, not just the pyrrolidines). Here's a visual summary of the strain of the docked poses—even without doing any particularly advanced analysis, you can see that there's a substantial number of compounds with strain above 5 kcal/mol, and even a few with strain over 10 kcal/mol.

The strain for docked poses.

Strain for docked poses.

In contrast, only a single crystallographic pose has a strain above 5 kcal/mol:

The strain for crystal poses.

Strain for crystallographic poses.

This suggests that the crystallographic poses don't just happen to be better for FEP. Instead, they're better because they're systematically lower in energy and generally more physical. The bad news is that our analogue-docking workflow isn't working perfectly, but the good news is that the quality of our FEP runs will improve if we can improve the pose-preparation pipeline.

We zoomed in on a few cases with particularly large changes. The predicted binding affinity of compound x7161a changed a lot between docked and crystallographic poss (0.8 kcal/mol or so, depending on protocol), which could be ascribed to a change in the conformation of the pendant methoxymethyl group:

Overlay of a tricky compound.

Comparison of the docked and crystallographic poses for x7161a.

Compound x7247a had the largest change between docked and crystallographic poses (3.66 Å RMSD), essentially flipping the pyrrolidine and moving a pyrazole substituent from one face to the other:

Overlay of a tricky compound.

Comparison of the docked and crystallographic poses for x7247a.

Overall, the data support the idea that pose preparation is a significant source of error for these results, although even crystallographic poses don't lead to stellar RBFE performance. It's worth noting that the big issue here isn't pose generation, it's selection—we're able to generate a large ensemble of poses, we just don't yet have a good way to tell which one to use for FEP calculations.

Economic Impact

Stepping back—is this worth it in real drug-discovery programs? The best FEP run ranked compounds with an accuracy of 0.464 (Spearman ρ\rho) and cost about $550 in GPU time, or about $17 per compound. The best accuracy obtained without crystallographic poses was 0.361 (Spearman ρ\rho), which more closely mimics how FEP would be used in the real world.

Quantifying economic value is tough, but the extremes are obvious—an FEP protocol that ranks compounds with ρ=0.00\rho = 0.00 is clearly useless and a waste of money, while a protocol that ranks things perfectly (ρ=1.00\rho = 1.00) would be very useful.

To try and model the cases in the middle, we built a toy simulation model that takes unknown "true" rankings and generates imperfect computational rankings with the same Spearman correlation observed in our RBFE benchmark. (Implementation details below for the curious.) We then ask—if we synthesize the top 10 compounds from the imperfect ranking, how many of the true top-10 compounds will actually be in that set? (We'll call this enrichment factor hh.)

Here's what the model predicts. At ρ=0.0\rho = 0.0, h=1.0h = 1.0 because we're just picking randomly, while at ρ=1.0\rho = 1.0 we're getting all 10. At the more realistic value of ρ=0.36\rho = 0.36, we predict that h=2.54h = 2.54. We're getting some enrichment but most "good virtual compounds" still aren't that good.

h(rho) vs rho.

Dependence of h(ρ)h(\rho) on ρ\rho.

Now to try and figure out if this is worth it. If hRBFEh_\text{RBFE} is the number of good compounds selected using RBFE and hbaseh_\text{base} is the number of good compounds selected without RBFE, then the net dollar value per cycle is:

CLk(hRBFEhbase1)NCRC_\text{L} \cdot k \cdot \left(\dfrac{h_\text{RBFE}}{h_\text{base}} - 1\right) - N \cdot C_\text{R}

where CLC_\text{L} is the all-in synthesis and assay cost per wet-lab compound, NN is the number of virtual compounds screened, CRC_\text{R} is the cost of running an RBFE calculation, and kk is the number of compounds actually synthesized and tested.

I'll choose some simple numbers here:

All together, versus random selection, we find that FEP creates about $28,700 of net value per 100-compound design cycle, or roughly $2,900 per compound actually synthesized in the 10-compound experimental tranche.

(The maximum value in this model is $178,000, corresponding to essentially getting 90 compounds or $180,000 worth of screening for only $2000 of in silico screening.)

Of course, this example is a bit rigged, and "drug discovery without FEP" does not usually imply "pick compounds at random." Medicinal chemists, docking, property filters, and lower-cost predictive models already provide some prioritization signal. The key question is how good that existing triage process is. In this toy model, the ρ=0.361\rho = 0.361 FEP protocol remains value-positive at the assumptions above as long as the baseline process recovers fewer than about 2.31 of the true top-10 compounds per cycle, corresponding to a rank correlation of roughly ρ0.32\rho \approx 0.32. Against a weaker ρ=0.20\rho = 0.20 baseline, FEP is worth about $6,900 per cycle; against a ρ=0.30\rho = 0.30 baseline, it is still slightly positive at about $800 per cycle.

Expected value of FEP vs rho.

The expected value of FEP based on ρ\rho.

The takeaway is not that FEP is always worth running—every company, program, and target are different, and nobody should use the above simplistic model as "proof" that FEP will or won't work for them. Rather, this model demonstrates that even moderately accurate FEP can be economically valuable when wet-lab slots are expensive and the existing prioritization stack is meaningfully weaker than the FEP model. At $20 per candidate, the computational cost is small enough that even modest improvements in experimental hit recovery can be worth it. (In fact, in the naïve model, per-compound RBFE costs of up to $307 are still expected-value positive.)

Conclusions

First, we hope that this post illustrates what running FEP in the real world can be like—it's more complicated than the oft-cited standard benchmarks might make it appear! If your initial FEP runs don't give the accuracy that you're hoping for, there are a lot of knobs to tune to try and recover useful performance (even apart from any improvements that we make here at Rowan). Different poses, different simulation settings, and different subsets of compounds can all lead to major differences in the utility of RBFE.

Second, we hope that this post is encouraging. Even if your FEP run doesn't give perfect rank correlation, running RBFE at scale can still be quite useful—provided that the extra expense of the software license and the skills needed to run RBFE doesn't erase these gains. That's part of why we're committed to making Rowan's FEP offering as fast, cheap, and easy to use as possible… making FEP cheaper, faster, and better will let more scientists and companies benefit from increased predictive power.

Third, this exercise helps us decide what we're going to work on next at Rowan. We would like Rowan FEP to work as well as possible, and we're planning to address some of the issues that this OpenBind dataset has highlighted:

If you're in early-stage drug discovery and you're interested in building or testing any of these features in conjunction with our team, please reach out! We love working alongside drug-discovery scientists to validate and stress-test new scientific functionality.

Appendix: Generating Realistic Imperfect Rankings

The model needs a way to generate an imperfect computational ranking with a specified Spearman rank correlation ρ\rho to the unknown "true" experimental ranking.

A convenient way to do this is:

  1. Generate a latent "true experimental score" XX for each compound.
  2. Generate a latent "computational score" YY that is correlated with XX.
  3. Rank compounds by XX to define the true experimental ordering, and by YY to define the computational ordering.

We assume the latent scores are jointly normal:

XN(0,1)X \sim \mathcal{N}(0, 1)

Y=rX+1r2ε,εN(0,1)Y = rX + \sqrt{1 - r^2} \cdot \varepsilon,\quad \varepsilon \sim \mathcal{N}(0, 1)

where rr controls how strongly the computational score tracks the true score.

Because our benchmark accuracy is reported as Spearman rank correlation rather than ordinary Pearson correlation, we choose rr so that the resulting rankings have the desired Spearman ρ\rho. For jointly normal variables, population Spearman correlation and latent Pearson correlation are related by ρ=(6/π)arcsin(r/2)\rho = (6 / \pi) \arcsin(r / 2), so we set r=2sin(πρ/6)r = 2 \sin(\pi \rho / 6).

Once those paired rankings are generated, the rest is simple:

This gives the expected recovery value h(ρ)h(\rho).

Thanks to GPT 5.5 for helping me work through the statistics here.

Banner background image

Start running calculations in minutes!

Our platform lets you submit, view, analyze, and share calculations using cutting-edge methods trusted by hundreds of leading scientists. We give every new user 500 free credits to start, plus more every week. Making an account and running your first calculation takes only seconds: start using Rowan today!

Start computing →

What to read next

Improving Rowan's Performance on the OpenBind EV-A71 Release

Improving Rowan's Performance on the OpenBind EV-A71 Release

How we recovered useful RBFE accuracy on a challenging real-world dataset.
May 20, 2026 · Corin Wagen
New Protein Visualizations

New Protein Visualizations

distilling insight from complexity; two-dimensional protein–ligand interaction diagrams; protein blob surfaces; space-filling molecule representations
May 19, 2026 · Ari Wagen
Notes on Rowan Engineering; Or How to Vibe-Refactor a Codebase

Notes on Rowan Engineering; Or How to Vibe-Refactor a Codebase

stuck in Rowan's dependency slough of despond; fleeing the complexity of microservices & partial refactors; multiplying packages to reduce complexity; using agents to vibe-refactor our whole codebase
May 13, 2026 · Jonathon Vandezande
Testing Rowan on the OpenBind EV-A71 Release

Testing Rowan on the OpenBind EV-A71 Release

How Rowan's analogue-docking and RBFE workflows fare on this dataset.
May 6, 2026 · Corin Wagen
Benchmarking Membrane-Permeability Predictors

Benchmarking Membrane-Permeability Predictors

Testing GNN-MTL and PyPermm on datasets of small molecules, macrocycles, and PROTACs
Apr 28, 2026 · Ari Wagen
Smarter Analogue Docking, Pocket Detection, and g-xTB Analytical Gradients

Smarter Analogue Docking, Pocket Detection, and g-xTB Analytical Gradients

more robust MCS detection; conformer sampling with torsional Monte Carlo; better alignment and RBFE results; a new pocket-detection workflow; analytical gradients now available for g-xTB
Apr 23, 2026 · Zachary Fried, Corin Wagen, Ari Wagen, and Jonathon Vandezande
g-xTB pKa and Website Redesign

g-xTB pKa and Website Redesign

the flaws with Rowan's AIMNet2-based pKa method; our new g-xTB-based approach; benchmarking and availability; a logo and new website for Rowan
Apr 15, 2026 · Corin Wagen and Ari Wagen
Easter Updates to Rowan

Easter Updates to Rowan

webhooks, draft workflows, and usage estimates for Rowan's Python API; tautomers in non-aqueous solvents; COSMO-based descriptors; overage-based billing; an FEP speed test; welcome Zach
Apr 9, 2026 · Eli Mann, Ari Wagen, Spencer Schneider, Jonathon Vandezande, and Corin Wagen
How Fast Can FEP Run?

How Fast Can FEP Run?

Pushing the speed limit for RBFE calculations run through TMD.
Apr 8, 2026 · Corin Wagen
Improving Rowan's API

Improving Rowan's API

API as a coequal interface to Rowan's product; what we're changing in v3.0.0 of rowan-python; typed outputs; new workflow API; more agent-friendly features; acknowledging our early partners here
Mar 19, 2026 · Eli Mann, Corin Wagen, Jonathon Vandezande, and Spencer Schneider