PGM Extra

A high-performance Rust implementation of the PGM-index (Piecewise Geometric Model index). Includes multiple variants of the index, as well as drop-in replacements for BTreeSet and BTreeMap.

Based on the paper: The PGM-index: a fully-dynamic compressed learned index with provable worst-case bounds by Paolo Ferragina and Giorgio Vinciguerra.

Features

High-Performance: Outperforms traditional B-trees on point queries while using less memory
Cache Efficient: Adaptive linear / binary search based on epsilon for optimal cache utilization
Multi-Level Indexing: Recursive index structure with configurable epsilon at each level
Generic: Works with all integer types (signed and unsigned)
Dynamic Variant: Support for insertions and deletions with auto-rebuild
Parallel Construction: Optional parallel index building using Rayon
No-std support: Enables usage on embedded / WASM

Performance

Query time: O(log n / log ε)
Space: O(n / ε) segments
Construction: O(n) time
Expected outcome: speedup of 2-4x over binary search on large datasets

ε is the error bound, which controls the trade-off between index size and query performance.

Benchmark Results

Benchmarks run on Apple M1 Pro. Dataset: 1M random u64 keys.

Query Performance (10K queries, 1M keys)

Data Structure	Time (µs)	Throughput	vs BTreeSet
HashMap	68.2	146.6 M/s	12.5x faster
Binary Search	266.8	37.5 M/s	3.2x faster
PGM (ε=256)	489.5	20.4 M/s	1.7x faster
PGM (ε=64)	520.3	19.2 M/s	1.6x faster
PGM (ε=16)	656.6	15.2 M/s	1.3x faster
BTreeSet	851.6	11.7 M/s	baseline
BTreeMap	906.1	11.0 M/s	0.9x

Memory Overhead (index only, 1M keys)

Data Structure	Memory	B/elem	vs BTreeSet
PGM (ε=256)	7.7 MB	8.09	3.2x smaller
PGM (ε=64)	8.0 MB	8.38	3.1x smaller
PGM (ε=16)	9.1 MB	9.55	2.7x smaller
BTreeSet	25.0 MB	26.18	baseline
HashMap	34.0 MB	35.65	1.4x larger
BTreeMap	40.2 MB	42.18	1.6x larger

Range Queries (1K queries, scan 100 elements)

Method	Time (µs)	vs BTreeSet
PGM + scan	120.5	1.4x faster
Binary Search + scan	129.9	1.3x faster
BTreeSet range	174.3	baseline

Installation

Add to your Cargo.toml:

[dependencies]
pgm-extra = "0.1"

Available features:

std: Enables Dynamic index and other std-only features
simd: Enables SIMD for linear search
parallel: Enables parallel index construction
serde: Enables serialization/deserialization

Quick Start

For most use cases, use Set or Map which own their data:

use pgm_extra::{Set, Map};

// Set (replacement for BTreeSet, includes same read & write interface)
//     (Even though it supports write operations, it's recommended to be used as read-only, due to index rebuild on mutation)
let set: Set<u64> = (0..1_000_000).collect();
assert!(set.contains(&500_000));

// Map (replacement for BTreeMap, includes same read & write interface)
//     (Even though it supports write operations, it's recommended to be used as read-only, due to index rebuild on mutation)
let map: Map<u64, &str> = vec![(1, "one"), (2, "two")].into_iter().collect();
assert_eq!(map.get(&1), Some(&"one"));

Using index types directly

For advanced use cases where you manage data separately:

use pgm_extra::index::{Static, Builder};

let data: Vec<u64> = (0..1_000_000).collect();

// Build index with custom epsilon
let index = Builder::new()
    .epsilon(128)          // Error bound for data level
    .epsilon_recursive(16) // Error bound for recursive levels
    .build(&data)
    .unwrap();

// Point query - find position of key
let key = 500_000u64;
let pos = index.lower_bound(&data, &key);
assert_eq!(data[pos], key);

// Check existence
assert!(index.contains(&data, &key));
assert!(!index.contains(&data, &1_000_001));

// Get approximate position with bounds
let approx = index.search(&key);
println!("Position ~{} in range [{}, {}]", approx.pos, approx.lo, approx.hi);

Index Types

Static

The default multi-level recursive index. Best for large datasets where query performance is critical.

use pgm_extra::index::Static;

let data: Vec<u64> = (0..100_000).collect();
let index = Static::new(&data, 64, 4).unwrap();

println!("Height: {}", index.height());
println!("Segments: {}", index.segments_count());
println!("Size: {} bytes", index.size_in_bytes());

OneLevel

Simple single-level variant. Lower overhead for smaller datasets.

use pgm_extra::index::OneLevel;

let data: Vec<u64> = (0..10_000).collect();
let index = OneLevel::new(&data, 64).unwrap();

Dynamic (requires `std` feature)

Dynamic variant where mutations are required. Supports insertions and deletions with automatic rebuilding.

use pgm_extra::index::Dynamic;

let mut index = Dynamic::new(64, 4);

// Insert keys (mutation)
index.extend(0..1000);

assert!(index.contains(&500));

index.remove(&500);
assert!(!index.contains(&500));

// Iterate in sorted order
for key in index.iter() {
    println!("{}", key);
}

Epsilon Selection

The epsilon parameter controls the trade-off between index size and query performance:

Epsilon	Segments	Query Speed	Use Case
4-16	Many	Fastest	Cache-critical workloads
32-64	Moderate	Fast	Default, good balance
128-256	Few	Moderate	Memory-constrained
512+	Minimal	Slower	Extreme memory savings

Parallel Construction

Enable the parallel feature for multi-threaded index building:

use pgm_extra::index::Static;

let data: Vec<u64> = (0..10_000_000).collect();
let index = Static::new_parallel(&data, 64, 4).expect("failed to build index");

Benchmarks

Run benchmarks with:

just bench

Signed Integers

The index correctly handles signed integers by mapping them to unsigned space. As shown in the example below, there are no collisions on absolute values of the keys.

use pgm_extra::index::Static;

let data: Vec<i64> = (-5000..5000).collect();
let index = Static::new(&data, 64, 4).unwrap();

assert!(index.contains(&data, &-100));
assert!(index.contains(&data, &100));

assert!(index.contains(&data, &-5000));
assert!(!index.contains(&data, &5000));

FAQ

Why is SIMD not manually implemented?

Manual SIMD (e.g., using AVX2 or NEON intrinsics) was tested and found to offer no performance benefit over the current implementation. LLVM's ability to auto-vectorize the loop makes it possible to write high-performance code without unsafe blocks or architecture-specific code.

References

License

MIT License

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.github/workflows		.github/workflows
benches		benches
src		src
tests		tests
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
justfile		justfile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PGM Extra

Features

Performance

Benchmark Results

Query Performance (10K queries, 1M keys)

Memory Overhead (index only, 1M keys)

Range Queries (1K queries, scan 100 elements)

Installation

Quick Start

Using index types directly

Index Types

Static

OneLevel

Dynamic (requires `std` feature)

Epsilon Selection

Parallel Construction

Benchmarks

Signed Integers

FAQ

Why is SIMD not manually implemented?

References

License

About

Uh oh!

Releases

Packages

Languages

License

itsfoxstudio/pgm-extra-rs

Folders and files

Latest commit

History

Repository files navigation

PGM Extra

Features

Performance

Benchmark Results

Query Performance (10K queries, 1M keys)

Memory Overhead (index only, 1M keys)

Range Queries (1K queries, scan 100 elements)

Installation

Quick Start

Using index types directly

Index Types

Static

OneLevel

Dynamic (requires std feature)

Epsilon Selection

Parallel Construction

Benchmarks

Signed Integers

FAQ

Why is SIMD not manually implemented?

References

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Dynamic (requires `std` feature)

Packages