A new paradigm for native-dimensional clustering

The Result

One algorithm found seven.
The other found sixty-four — and discarded a third of the data.

The same dataset. The same sentence-BERT embeddings. The same UMAP pipeline. The only variable was the clustering algorithm. HDBSCAN — the industry standard for a decade — produced 64 fragments and threw 30.4% of the data into a noise pile. AdaGraph, built on the structure-centric paradigm, produced 7 clean clusters with zero data discarded. ARI: 0.751 vs 0.464.

+24– 62%

ARI improvement on text

Across four real-world datasets, AdaHD averages 0.50 ARI vs 0.35 for HDBSCAN.

0%

Data discarded as noise

Every point gets assigned to a cluster. No silent signal loss.

5,000D

Native dimensionality

Graph-SCOPE holds at τ ≈ 0.95 where Silhouette collapses to 0.46.

3domains

Independent validation

Text, cancer genomics, materials science. Same structural advantages.

The Argument

Distance fails. Structure doesn't.

In high dimensions, every point is roughly equidistant from every other point. This is the textbook "curse of dimensionality," and it's the reason every modern clustering pipeline ends with the same preprocessing step: reduce the dimensions first, then cluster. That preprocessing is where the signal dies.

Structure-centric clustering changes what is being measured. Cluster identity is encoded not through point-to-point geometry but through the topological organization of points within their native feature space — using a kNN graph as the substrate. The result is a family of algorithms whose parameters are scale-invariant, whose validity metrics survive into the thousands of dimensions, and whose deployment workflow scales linearly from a 1,000-point sample to a 500,000-point dataset.

Two Stacks, One Paradigm

Choose the stack that fits your data.

STACK · LOW-D

SCOPE → AdaBox → SLCD

The complete two-dimensional pipeline. SCOPE diagnoses clustering quality across five components. AdaBox produces structure-centric clusters with scale-invariant parameters. SLCD transfers parameters from sample to full deployment. Strongest in any dimensionality where dimensionality reduction (UMAP, PCA, t-SNE) precedes clustering — including the BERTopic pipeline used by every social listening and trend detection platform in production today.

Examine the stack →

STACK · HIGH-D

Graph-SCOPE → AdaGraph → DA-Sampler → SLCD

The native high-dimensional pipeline. Graph-SCOPE evaluates clustering quality at any dimensionality without distance assumptions. AdaGraph clusters in 100–5,000+ dimensions without any reduction. The Density-Aware Sampler preserves density structure. SLCD deploys to half a million points. Strongest in domains where information lives in the native feature space: gene expression, materials properties, multi-modal sensor data.

Examine the stack →

The high-dimensional stack is superior in high-dimensional applications. The low-dimensional stack is superior in low-dimensional applications. Structure-centric is superior in both.

— A. Elmahdi · arXiv: 2605.16320

Cross-Domain Validation

Three domains. Same advantage.

01 · GENOMICS

Cancer module discovery

On lung adenocarcinoma (246 patients) and hepatocellular carcinoma (431 patients), AdaGraph found gene modules that WGCNA — the field's gold standard with 18,000+ citations — discarded entirely as background. 24 wins, 3 ties, 13 losses against four established competitors.

See the genomics study →

02 · TEXT & NLP

Topic modeling at production scale

Across 20-Newsgroups subsets (n=5,581 to 17,901) and AG-News (n=7,600), AdaHD achieves 0.5015 average ARI with 0% data loss versus 0.3516 for HDBSCAN with 23–36% noise rejection. SCOPE quality score: 0.76 vs 0.30.

See the text benchmark →

03 · MATERIALS SCIENCE

Superconductor family discovery

On the SuperCon database (21,263 materials, 81 dimensions), AdaGraph identified 18 superconductor families aligned with the known physical classification — BCS, iron-based, cuprate — with no T꜀ supervision. K-Means+Silhouette returned k=2.

See the materials study →

Status

Patents filed. arXiv submitted. Ready for partnership.

5

Patent applications

AdaBox, SCOPE filed Jan 2026. AdaGraph & Graph-SCOPE prepared Apr 2026. Six independent claims.

7

Inventions in the family

The paradigm, AdaBox, AdaGraph, SCOPE, Graph-SCOPE, SLCD, Density-Aware Sampler.

2605.16320

arXiv preprint

AdaGraph and the Structure-Centric Machine Learning paradigm. arXiv:2603.13339 (AdaBox) also available.

Three commercial verticals are open: social listening & trend detection (text), bioinformatics & drug discovery (genomics), materials informatics (high-dimensional scientific data). Licensing terms favor reference customers. Open a conversation →

It should have been measuring structure.

One algorithm found seven.The other found sixty-four — and discarded a third of the data.

Distance fails. Structure doesn't.

Choose the stack that fits your data.

SCOPE → AdaBox → SLCD

Graph-SCOPE → AdaGraph → DA-Sampler → SLCD

Three domains. Same advantage.

Cancer module discovery

Topic modeling at production scale

Superconductor family discovery

Patents filed. arXiv submitted. Ready for partnership.

One algorithm found seven.
The other found sixty-four — and discarded a third of the data.