A high-performance Rust implementation of the PGM-index (Piecewise Geometric Model index).
Includes multiple variants of the index, as well as drop-in replacements for BTreeSet and BTreeMap.
Based on the paper: The PGM-index: a fully-dynamic compressed learned index with provable worst-case bounds by Paolo Ferragina and Giorgio Vinciguerra.
- High-Performance: Outperforms traditional B-trees on point queries while using less memory
- Cache Efficient: Adaptive linear / binary search based on epsilon for optimal cache utilization
- Multi-Level Indexing: Recursive index structure with configurable epsilon at each level
- Generic: Works with all integer types (signed and unsigned)
- Dynamic Variant: Support for insertions and deletions with auto-rebuild
- Parallel Construction: Optional parallel index building using Rayon
- No-std support: Enables usage on embedded / WASM
- Query time: O(log n / log ε)
- Space: O(n / ε) segments
- Construction: O(n) time
- Expected outcome: speedup of 2-4x over binary search on large datasets
ε is the error bound, which controls the trade-off between index size and query performance.
Benchmarks run on Apple M1 Pro. Dataset: 1M random u64 keys.
| Data Structure | Time (µs) | Throughput | vs BTreeSet |
|---|---|---|---|
| HashMap | 68.2 | 146.6 M/s | 12.5x faster |
| Binary Search | 266.8 | 37.5 M/s | 3.2x faster |
| PGM (ε=256) | 489.5 | 20.4 M/s | 1.7x faster |
| PGM (ε=64) | 520.3 | 19.2 M/s | 1.6x faster |
| PGM (ε=16) | 656.6 | 15.2 M/s | 1.3x faster |
| BTreeSet | 851.6 | 11.7 M/s | baseline |
| BTreeMap | 906.1 | 11.0 M/s | 0.9x |
| Data Structure | Memory | B/elem | vs BTreeSet |
|---|---|---|---|
| PGM (ε=256) | 7.7 MB | 8.09 | 3.2x smaller |
| PGM (ε=64) | 8.0 MB | 8.38 | 3.1x smaller |
| PGM (ε=16) | 9.1 MB | 9.55 | 2.7x smaller |
| BTreeSet | 25.0 MB | 26.18 | baseline |
| HashMap | 34.0 MB | 35.65 | 1.4x larger |
| BTreeMap | 40.2 MB | 42.18 | 1.6x larger |
| Method | Time (µs) | vs BTreeSet |
|---|---|---|
| PGM + scan | 120.5 | 1.4x faster |
| Binary Search + scan | 129.9 | 1.3x faster |
| BTreeSet range | 174.3 | baseline |
Add to your Cargo.toml:
[dependencies]
pgm-extra = "0.1"Available features:
std: EnablesDynamicindex and other std-only featuressimd: Enables SIMD for linear searchparallel: Enables parallel index constructionserde: Enables serialization/deserialization
For most use cases, use Set or Map which own their data:
use pgm_extra::{Set, Map};
// Set (replacement for BTreeSet, includes same read & write interface)
// (Even though it supports write operations, it's recommended to be used as read-only, due to index rebuild on mutation)
let set: Set<u64> = (0..1_000_000).collect();
assert!(set.contains(&500_000));
// Map (replacement for BTreeMap, includes same read & write interface)
// (Even though it supports write operations, it's recommended to be used as read-only, due to index rebuild on mutation)
let map: Map<u64, &str> = vec![(1, "one"), (2, "two")].into_iter().collect();
assert_eq!(map.get(&1), Some(&"one"));For advanced use cases where you manage data separately:
use pgm_extra::index::{Static, Builder};
let data: Vec<u64> = (0..1_000_000).collect();
// Build index with custom epsilon
let index = Builder::new()
.epsilon(128) // Error bound for data level
.epsilon_recursive(16) // Error bound for recursive levels
.build(&data)
.unwrap();
// Point query - find position of key
let key = 500_000u64;
let pos = index.lower_bound(&data, &key);
assert_eq!(data[pos], key);
// Check existence
assert!(index.contains(&data, &key));
assert!(!index.contains(&data, &1_000_001));
// Get approximate position with bounds
let approx = index.search(&key);
println!("Position ~{} in range [{}, {}]", approx.pos, approx.lo, approx.hi);The default multi-level recursive index. Best for large datasets where query performance is critical.
use pgm_extra::index::Static;
let data: Vec<u64> = (0..100_000).collect();
let index = Static::new(&data, 64, 4).unwrap();
println!("Height: {}", index.height());
println!("Segments: {}", index.segments_count());
println!("Size: {} bytes", index.size_in_bytes());Simple single-level variant. Lower overhead for smaller datasets.
use pgm_extra::index::OneLevel;
let data: Vec<u64> = (0..10_000).collect();
let index = OneLevel::new(&data, 64).unwrap();Dynamic variant where mutations are required. Supports insertions and deletions with automatic rebuilding.
use pgm_extra::index::Dynamic;
let mut index = Dynamic::new(64, 4);
// Insert keys (mutation)
index.extend(0..1000);
assert!(index.contains(&500));
index.remove(&500);
assert!(!index.contains(&500));
// Iterate in sorted order
for key in index.iter() {
println!("{}", key);
}The epsilon parameter controls the trade-off between index size and query performance:
| Epsilon | Segments | Query Speed | Use Case |
|---|---|---|---|
| 4-16 | Many | Fastest | Cache-critical workloads |
| 32-64 | Moderate | Fast | Default, good balance |
| 128-256 | Few | Moderate | Memory-constrained |
| 512+ | Minimal | Slower | Extreme memory savings |
Enable the parallel feature for multi-threaded index building:
use pgm_extra::index::Static;
let data: Vec<u64> = (0..10_000_000).collect();
let index = Static::new_parallel(&data, 64, 4).expect("failed to build index");Run benchmarks with:
just benchThe index correctly handles signed integers by mapping them to unsigned space. As shown in the example below, there are no collisions on absolute values of the keys.
use pgm_extra::index::Static;
let data: Vec<i64> = (-5000..5000).collect();
let index = Static::new(&data, 64, 4).unwrap();
assert!(index.contains(&data, &-100));
assert!(index.contains(&data, &100));
assert!(index.contains(&data, &-5000));
assert!(!index.contains(&data, &5000));Manual SIMD (e.g., using AVX2 or NEON intrinsics) was tested and found to offer no performance benefit over the current implementation. LLVM's ability to auto-vectorize the loop makes it possible to write high-performance code without unsafe blocks or architecture-specific code.
MIT License
Copyright (c) 2025 Fox Studio (Oskar Cieslik)
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.