Real-World Benchmark
500,000 chunks · Wikipedia + arXiv + Project Gutenberg · NVIDIA A40 · Independently verifiable — download and run it yourself.
Corpus & Methodology
Mixed real-world corpus across three domains — general knowledge, scientific literature, and long-form prose.
| Source | Domain | Size | Description |
|---|---|---|---|
| Wikipedia (Simple English) | General knowledge | ~100 MB | Encyclopedia articles |
| arXiv papers | Science / ML | ~40 MB | CS & ML abstracts + intros |
| Project Gutenberg | Literature | ~28 MB | Public domain books |
| Total | Mixed | ~168 MB raw | 642,939 paragraphs → 500,000 chunks |
Chunking: 400 words / chunk, 50-word overlap. Embedding: BAAI/bge-m3 (1024-dim) on NVIDIA A40. Recall: 1,000 queries vs exact cosine top-k ground truth.
Retrieval Accuracy
BGE-M3 · 1024-bit binary fingerprints
| Metric | NodeMind MIH | Ground Truth |
|---|---|---|
| Recall@1 | 0.999 | 1.000 |
| Recall@3 | 0.999 | 1.000 |
| Recall@5 | 1.000 | 1.000 |
| Recall@10 | 1.000 | 1.000 |
| Recall@20 | 1.000 | 1.000 |
| MRR@10 | 0.9992 | 1.000 |
BGE-base · 768-bit and 256-bit (PCA)
| Metric | 768-bit | 256-bit (PCA) |
|---|---|---|
| Recall@1 | 0.999 | 1.000 |
| Recall@5 | 1.000 | 1.000 |
| Recall@10 | 1.000 | 1.000 |
| MRR@10 | 0.9995 | 1.000 |
Index Size — 500,000 Chunks
| Index | Size | Bytes / chunk | vs float32 |
|---|---|---|---|
| NodeMind BGE-M3 (1024-bit) | 64 MB | 128 B | 32× |
| Float32 RAG — BGE-M3 (baseline) | 2,048 MB | 4,096 B | 1× (reference) |
| HNSW index (float32 × 1.5× overhead) | 3,072 MB | 6,144 B | 48× vs NM |
| NodeMind BGE-base 256-bit (PCA) | 16 MB | 32 B | 96× |
Index only — document text stored separately and equally in all systems.
Download — Verify It Yourself
All files generated from the same 500,000 chunks. Download NodeMind + float32 RAG side by side to verify compression ratios yourself.
Verify compression in Python
# pip install sentence-transformers (only needed for query, not verification)
import pickle
with open("nm_bgem3_index.pkl", "rb") as f: nm = pickle.load(f)
with open("rag_bgem3_index.pkl","rb") as f: rag = pickle.load(f)
nm_mb = nm["fps"].nbytes / 1e6 # → 64
rag_mb = rag["embeddings"].nbytes / 1e6 # → 2048
ratio = rag["embeddings"].nbytes // nm["fps"].nbytes # → 32
print(f"NodeMind : {nm_mb:.0f} MB")
print(f"Float32 : {rag_mb:.0f} MB")
print(f"Ratio : {ratio}×")
# BGE-base 256-bit (96×)
with open("nm_bgebase256_index.pkl","rb") as f: nm96 = pickle.load(f)
# nm96["fps"] shape: (500000, 32) → 16 MB
# float32 baseline: 500000 × 768 × 4 = 1,536 MB → 96×
Run a query
import numpy as np
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("BAAI/bge-m3")
fps = nm["fps"]
with open("corpus.pkl","rb") as f: corpus = pickle.load(f)
chunks = corpus["chunks"]
POPCOUNT = np.array([bin(i).count('1') for i in range(256)], dtype=np.int32)
def query_nodemind(text, top_k=5):
emb = model.encode([text], normalize_embeddings=True)[0]
q_fp = _binarise(emb, nm) # binarisation uses index metadata (patent-protected)
dists = POPCOUNT[np.bitwise_xor(fps, q_fp[np.newaxis, :])].sum(axis=1)
top = np.argsort(dists)[:top_k]
return [(int(dists[i]), chunks[i][:120]) for i in top]
for dist, text in query_nodemind("What is quantum entanglement?"):
print(f" [{dist:4d}] {text}")
The _binarise function uses the metadata stored in the pkl file. The exact method is covered under AU 2026901656 — the index is self-contained without reading the patent.
How It Works
Text is chunked and embedded with a sentence model (BGE-M3 or BGE-base), producing a float32 vector per chunk.
Each embedding is converted to a compact binary fingerprint using pre-computed index metadata. Integer-only — no GPU at query time. Method is patent-protected (AU 2026901656).
Binary fingerprints stored in a Multi-Index Hash structure. Query finds candidates by Hamming distance — pure integer arithmetic, any CPU. MIH structure: Norouzi et al. CVPR 2012. Novel contribution (AU 2026901657): CTV binarisation + portable single-file format.
Embed query → binarise → Hamming search. Single .pkl file, no server, no Docker, no external DB.
Honest Caveats
- Self-retrieval benchmark. Queries are perturbed corpus chunks — optimistic for binary methods. End-to-end QA on BEIR / MS MARCO not yet measured.
- HNSW comparison is size-only. Real FAISS HNSW achieves recall@10 ≈ 0.95–0.99 with graph traversal. A direct neutral head-to-head on a held-out set has not been run yet.
- 96× requires a lighter model. BGE-base + PCA-256 gives 96×. BGE-M3 (stronger, cross-lingual) gives 32×/48×.
- Text-only corpus. Tables, code blocks, and multi-modal documents were not tested.
- 2 GB download for float32 baseline. Budget the bandwidth if you want to verify baseline sizes yourself.
Patents
AU 2026901656 — WHT Integer Codec: integer-only binarisation without learned projection.
AU 2026901657 — NodeMind Centroid MIH: CTV-based binary fingerprinting + MIH search.
Filed IP Australia, May 2026. Built in Coleambally, NSW, Australia.