Fast and Efficient Butina Splitting of Chemical Data with Chalcedon

by Eli Mann · May 26, 2026

The Council of Chalcedon.

Chalcedon is named for the ecumenical Council of Chalcedon held in 451 AD, which condemned miaphysitism and resulted in the first major schism in the Christian Church.

Last summer, we worked with former Rowan intern Vedanth on ML-based 3D molecular conformer ensemble generation. We wanted to use a Butina split based on Tanimoto distance between Morgan fingerprints to minimize data leakage between the train, validation, and test sets, following the recommendations in "Some Thoughts on Splitting Chemical Datasets" by Pat Walters.

In practice, however, we found that existing open-source implementations required a machine with multiple TB of memory to Butina-cluster the ≈300K-sample GEOM dataset by Bombarelli et al.! We knew that a more efficient method would be useful, so we invited the community to develop one in our call for open source projects. Since no one picked up this task, we wrote our own implementation.

Here we present Chalcedon, a fast, memory-efficient, and minimal-dependency Python package for Butina clustering and splitting chemical data. This implementation has linear memory scaling and is an order of magnitude faster than existing implementations.

The goal of this work is to make Butina splitting practical for routine use in machine-learning projects in chemistry by enabling fast execution on consumer hardware. We hope to expand Chalcedon with additional 2D and 3D clustering and splitting algorithms as needed.

How it works

We use Morgan fingerprints, a vector encoding of molecular topology and local atomic features, to represent molecules. When performing similarity and clustering operations on molecules, we are doing so on their fingerprint representations. While Morgan fingerprints are standard for Butina clustering, this method may be used with other kinds of chemical encodings as well.

Tanimoto similarity

Tanimoto similarity (or Jaccard index) measures feature overlap based only on co-occurrence of present features, ignoring shared absences. This is well suited to capturing signal in the sparse, high-dimensional vectors common to many molecular encodings.

Tanimoto similarity for binary vectors can be formulated:

T(A,B)=ABAB=ABA+BABT(A,B) = \frac{|A \cap B|}{|A \cup B|} = \frac{|A \cap B|}{|A| + |B| - |A \cap B|}

or in English, the Tanimoto similarity of sets AA and BB is the ratio of binary features present in both AA and BB to the features present in AA or BB. The more AA and BB's encodings overlap, the greater the similarity score. Tanimoto distance is the complement of Tanimoto similarity: d(A,B)=1T(A,B)d(A,B) = 1-T(A,B).

Butina clustering

The Butina clustering algorithm is a greedy centroid-based clustering algorithm and can be understood as follows:

  1. Compute the pairwise distance between all datapoints.
  2. Select a distance threshold that defines whether 2 datapoints are neighbors (0.65 by default in our case, meaning molecules must be at least 35% similar to be neighbors).
  3. Sort datapoints by number of neighbors in descending order.
  4. Walk the sorted list: each still-unassigned datapoint becomes a cluster centroid and claims its still-unassigned neighbors as a cluster. Continue until every datapoint is assigned.

This algorithm is deterministic, easily interpretable, and determines cluster number based on the data distribution and similarity threshold.

Greedy cluster splitting

Once Butina clusters are assigned, the final task is to split the data such that train, validation, and test sets are chemically different. This better evaluates generalizability and minimizes data leakage. To assign clusters to dataset splits, we use the longest-processing-time (LPT) algorithm introduced by R. L. Graham for scheduling jobs while multiprocessing. Rather than balancing jobs by processing time, we use the same concept to balance splits by dataset size, greedily assigning the largest clusters first.

The algorithm is as follows:

  1. Determine the desired size for each subset based on a specified ratio (e.g. train = 80%, validation = 10%, test = 10%).
  2. Sort clusters by size in descending order.
  3. Starting with the largest cluster, assign each cluster to whichever subset is furthest below its target size.

This assignment is simple, allows for pre-determined split sizes (within 1 cluster of error), works for any number of clusters of arbitrary sizes, and balances clusters across splits to maintain intra-subset diversity.

Why it's fast and memory-efficient

Our implementation of Tanimoto similarity and Butina clustering is written entirely in NumPy. Performance comes from BLAS (basic linear algebra subprograms)-accelerated matrix multiplication and algorithmic choices to control memory layout.

Tanimoto similarity

Float32 precision. We use float32 precision throughout similarity calculations as sgemm (single-precision general matrix multiply) has ≈2x faster throughput and a smaller memory footprint than dgemm (double-precision general matrix multiply) in BLAS. For Tanimoto-similarity calculations on binary or count fingerprints, float32 still leaves plenty of precision headroom for realistic inputs, and further decreasing precision would not improve performance as BLAS implementations are optimized for float32 and float64 precisions.

Algebraic rearrangement of the cutoff comparison. We algebraically rearrange the Tanimoto-distance and predefined cutoff comparison to improve precision and avoid the creation of intermediary NumPy array allocations. The naive comparison function can be written:

1ABABcutoff1 - \frac{|A \cap B|}{|A \cup B|} \le \text{cutoff}

which in Python is expressed:

mask = 1 - (dot / union) <= cutoff

This expression evaluates left to right, creating 2 intermediate NumPy arrays under the hood:

tmp1 = dot / union
tmp2 = 1 - tmp1
mask = tmp2 <= cutoff

If we rearrange the expression to the algebraically equivalent

AB(1cutoff)AB|A \cap B| \ge (1 - \text{cutoff}) \cdot |A \cup B|

we can do the cutoff comparison before floating-point array operations to cut the ULP (unit in the last place, or unit of least precision) in half, and write results in-place to avoid the overhead of creating intermediary arrays:

# mask: pre-allocated bool array of shape (n,)
threshold = np.float32(1 - cutoff)
union *= threshold
np.greater_equal(dot, union, out=mask)

Precomputed per-row norms. We precompute the self dot-product A2|A|^2 for every fingerprint AA, so when we repeatedly evaluate ABA \cup B for every molecule pair, both norms are read from the cache rather than recomputed.

AB=A2+B2AB|A \cup B| = |A|^2 + |B|^2 - A \cdot B

Butina clustering

Chunked neighbor computation. In its standard form, the Butina clustering algorithm is memory intensive because it requires computing and persisting the full n×nn \times n pairwise distance matrix for cluster assignment. At nn = 50k, that's 2.5 billion pair distances, or about 100 GB worth of Python floats.

We avoid this quadratic memory scaling by reformulating the algorithm into three phases:

  1. Compute the pairwise similarities in batches, persist the number of neighbors for each molecule, and discard the actual similarity values.
  2. Sort molecules by descending neighbor count.
  3. Walk the sorted molecules in batches, recompute the similarity row for each molecule against the still-unassigned set, and claim its still-unassigned neighbors as a new cluster.

Peak working memory is bounded by one batch's workspace plus the O(n)O(n) counts.

Upper-triangle matrix traversal. Since the Tanimoto similarity matrix is symmetric, computing the full matrix is redundant. We only walk the upper triangle of the pairwise matrix, recursively halving diagonal blocks into off-diagonal sub-blocks so every matrix multiplication is a full rectangle (inspired by SciPy's ssyrk) and each pair is computed only once.

Shared workspace buffers. We preallocate array buffers, sized to fit the largest block we compute, to eliminate the overhead of repeated dynamic array allocation and keep memory flat and predictable across iterations.

How it performs

We compare four methods: the RDKit reference implementation, Chalcedon (chunked float32), Chalcedon (chunked float64), and Chalcedon (full matrix float32), which precomputes the entire distance matrix instead of doing chunked neighbor computation. All clustering algorithms were run on an AMD Ryzen 9 7950X3D CPU with 128 GB of DDR5 RAM. Peak RSS (resident set size) is the maximum physical memory the process held during the run.

Scaling

As a benchmark dataset, we randomly selected subsets from the GEOM dataset ranging from 10k to 100k samples. We run each Butina clustering algorithm with cutoff = 0.65 and ensure that the resulting clusters are identical across all methods.

Wall time and peak memory scaling of each method.

Wall time (s):

nChalcedon (chunked, float32)Chalcedon (chunked, float64)Chalcedon (full matrix, float32)RDKit
1000.0060.0100.0010.005
1,0000.0400.0370.0120.073
10,0000.3570.6280.2815.160
25,0001.6402.9101.49032.150
50,0005.75010.1805.740172.820
100,00021.14037.72022.490OOM

Peak RSS (GB):

nChalcedon (chunked, float32)Chalcedon (chunked, float64)Chalcedon (full matrix, float32)RDKit
1000.080.080.070.08
1,0000.110.140.090.12
10,0000.330.550.354.50
25,0000.671.181.0227.70
50,0001.232.253.16110.40
100,0002.354.3911.17OOM

At small dataset sizes, the Chalcedon (full matrix) implementation is fastest and all four methods show similar memory consumption. As datasets scale, the RDKit implementation becomes significantly slower and consumes more memory than the Chalcedon methods. RDKit and Chalcedon (full matrix) scale quadratically in memory, while the Chalcedon (chunked) methods scale roughly linearly. The float64 implementation roughly doubles the float32 implementation in both wall time and peak RSS.

Cutoff sweep

We ran each method using cutoffs spanning 0.3–0.9 on our 25k subset and found discrepancies in the resulting clusters at some cutoffs. These discrepancies are due to precision differences and the algebraic rearrangement of the cutoff comparison, as floating-point arithmetic can yield different results despite mathematical equivalence. While the methods are not bitwise identical at every cutoff, the discrepancies are very small where they exist (≤ 2.5% of cluster count).

Number of clusters:

cutoffChalcedon (chunked and full matrix, float32)Chalcedon (chunked, float64)RDKit
0.3022,84122,84122,883
0.4019,37719,31019,310
0.702,9833,0592,991

BitBIRCH-Lean comparison

As a challenge, we benchmarked Chalcedon's exact Butina (O(n2)O(n^2) time) against BitBIRCH-Lean, an approximate hierarchical clustering algorithm based on the BIRCH algorithm (≈O(n)O(n) time). Unsurprisingly, BitBIRCH-Lean pulls away on both wall time and peak memory as dataset sizes increase, but Chalcedon remains comparable and a practical choice at these sizes while producing exact Butina clusters.

Chalcedon vs. BitBIRCH-Lean.

Wall time (s)

nChalcedonBitBIRCH-Lean
1000.0060.005
1,0000.0400.038
10,0000.3570.482
25,0001.6401.250
50,0005.7502.560
100,00021.1405.420

Peak RSS (GB)

nChalcedonBitBIRCH-Lean
1000.080.09
1,0000.110.09
10,0000.330.13
25,0000.670.19
50,0001.230.30
100,0002.350.52

As they are fundamentally different clustering algorithms, the two methods produce different partitions even at the same nominal threshold (BitBIRCH-Lean's similarity threshold set to 1 − cutoff to match Chalcedon's distance cutoff).

Number of clusters:

nChalcedon (chunked, float32)BitBIRCH-Lean
1009598
1,000678866
10,0003,1077,382
25,0005,05816,318
50,0007,06729,140
100,0009,59051,232

How to use Chalcedon

Install from PyPI: uv pip install chalcedon.

For our recommended use case, the algorithm can be run directly from SMILES, with Morgan fingerprints (radius 2, 2048 bits, float32) computed internally.

import chalcedon

smiles = [
    "CCO",
    "c1ccccc1",
    # ...
]

splits = chalcedon.butina_split(
    smiles,
    fractions={"train": 0.8, "val": 0.1, "test": 0.1},
    cutoff=0.65,
    dtype="float32",  # or np.float32
)

train_smiles = splits["train"]
val_smiles = splits["val"]
test_smiles = splits["test"]

For full customizability, drop down to butina_cluster with your own fingerprints or descriptors. We recommend float64 for non-binary descriptors.

import numpy as np
import chalcedon

descriptors = my_descriptor_generator(molecules)  # numpy.ndarray of shape (n, d)

cluster_ids = chalcedon.butina_cluster(descriptors, cutoff=0.45, dtype="float64")
splits = chalcedon.greedy_cluster_split(
    cluster_ids,
    fractions={"train": 0.8, "val": 0.1, "test": 0.1},
)

train_indices = splits["train"]  # numpy.ndarray of indices into `descriptors`

Source code for the Chalcedon package and for reproducing our experiments can be found on GitHub. If there are other clustering algorithms you think we should include, or if you find bugs, please reach out to us at contact@rowansci.com or leave an issue on the GitHub repo.

Banner background image

Start running calculations in minutes!

Our platform lets you submit, view, analyze, and share calculations using cutting-edge methods trusted by hundreds of leading scientists. We give every new user 500 free credits to start, plus more every week. Making an account and running your first calculation takes only seconds: start using Rowan today!

Start computing →

What to read next

Fast and Efficient Butina Splitting of Chemical Data with Chalcedon

Fast and Efficient Butina Splitting of Chemical Data with Chalcedon

A fast, memory-efficient, minimal-dependency Python package for Butina clustering and splitting chemical data.
May 27, 2026 · Eli Mann
Improving Rowan's Performance on the OpenBind EV-A71 Release

Improving Rowan's Performance on the OpenBind EV-A71 Release

How we recovered useful RBFE accuracy on a challenging real-world dataset.
May 20, 2026 · Corin Wagen
New Protein Visualizations

New Protein Visualizations

distilling insight from complexity; two-dimensional protein–ligand interaction diagrams; protein blob surfaces; space-filling molecule representations
May 19, 2026 · Ari Wagen
Notes on Rowan Engineering; Or How to Vibe-Refactor a Codebase

Notes on Rowan Engineering; Or How to Vibe-Refactor a Codebase

stuck in Rowan's dependency slough of despond; fleeing the complexity of microservices & partial refactors; multiplying packages to reduce complexity; using agents to vibe-refactor our whole codebase
May 13, 2026 · Jonathon Vandezande
Testing Rowan on the OpenBind EV-A71 Release

Testing Rowan on the OpenBind EV-A71 Release

How Rowan's analogue-docking and RBFE workflows fare on this dataset.
May 6, 2026 · Corin Wagen
Benchmarking Membrane-Permeability Predictors

Benchmarking Membrane-Permeability Predictors

Testing GNN-MTL and PyPermm on datasets of small molecules, macrocycles, and PROTACs
Apr 28, 2026 · Ari Wagen
Smarter Analogue Docking, Pocket Detection, and g-xTB Analytical Gradients

Smarter Analogue Docking, Pocket Detection, and g-xTB Analytical Gradients

more robust MCS detection; conformer sampling with torsional Monte Carlo; better alignment and RBFE results; a new pocket-detection workflow; analytical gradients now available for g-xTB
Apr 23, 2026 · Zachary Fried, Corin Wagen, Ari Wagen, and Jonathon Vandezande
g-xTB pKa and Website Redesign

g-xTB pKa and Website Redesign

the flaws with Rowan's AIMNet2-based pKa method; our new g-xTB-based approach; benchmarking and availability; a logo and new website for Rowan
Apr 15, 2026 · Corin Wagen and Ari Wagen
Easter Updates to Rowan

Easter Updates to Rowan

webhooks, draft workflows, and usage estimates for Rowan's Python API; tautomers in non-aqueous solvents; COSMO-based descriptors; overage-based billing; an FEP speed test; welcome Zach
Apr 9, 2026 · Eli Mann, Ari Wagen, Spencer Schneider, Jonathon Vandezande, and Corin Wagen
How Fast Can FEP Run?

How Fast Can FEP Run?

Pushing the speed limit for RBFE calculations run through TMD.
Apr 8, 2026 · Corin Wagen