Skip to content

itsfoxstudio/pgm-extra-rs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PGM Extra

Crates.io Documentation License

A high-performance Rust implementation of the PGM-index (Piecewise Geometric Model index). Includes multiple variants of the index, as well as drop-in replacements for BTreeSet and BTreeMap.

Based on the paper: The PGM-index: a fully-dynamic compressed learned index with provable worst-case bounds by Paolo Ferragina and Giorgio Vinciguerra.

Features

  • High-Performance: Outperforms traditional B-trees on point queries while using less memory
  • Cache Efficient: Adaptive linear / binary search based on epsilon for optimal cache utilization
  • Multi-Level Indexing: Recursive index structure with configurable epsilon at each level
  • Generic: Works with all integer types (signed and unsigned)
  • Dynamic Variant: Support for insertions and deletions with auto-rebuild
  • Parallel Construction: Optional parallel index building using Rayon
  • No-std support: Enables usage on embedded / WASM

Performance

  • Query time: O(log n / log ε)
  • Space: O(n / ε) segments
  • Construction: O(n) time
  • Expected outcome: speedup of 2-4x over binary search on large datasets

ε is the error bound, which controls the trade-off between index size and query performance.

Benchmark Results

Benchmarks run on Apple M1 Pro. Dataset: 1M random u64 keys.

Query Performance (10K queries, 1M keys)

Data Structure Time (µs) Throughput vs BTreeSet
HashMap 68.2 146.6 M/s 12.5x faster
Binary Search 266.8 37.5 M/s 3.2x faster
PGM (ε=256) 489.5 20.4 M/s 1.7x faster
PGM (ε=64) 520.3 19.2 M/s 1.6x faster
PGM (ε=16) 656.6 15.2 M/s 1.3x faster
BTreeSet 851.6 11.7 M/s baseline
BTreeMap 906.1 11.0 M/s 0.9x

Memory Overhead (index only, 1M keys)

Data Structure Memory B/elem vs BTreeSet
PGM (ε=256) 7.7 MB 8.09 3.2x smaller
PGM (ε=64) 8.0 MB 8.38 3.1x smaller
PGM (ε=16) 9.1 MB 9.55 2.7x smaller
BTreeSet 25.0 MB 26.18 baseline
HashMap 34.0 MB 35.65 1.4x larger
BTreeMap 40.2 MB 42.18 1.6x larger

Range Queries (1K queries, scan 100 elements)

Method Time (µs) vs BTreeSet
PGM + scan 120.5 1.4x faster
Binary Search + scan 129.9 1.3x faster
BTreeSet range 174.3 baseline

Installation

Add to your Cargo.toml:

[dependencies]
pgm-extra = "0.1"

Available features:

  • std: Enables Dynamic index and other std-only features
  • simd: Enables SIMD for linear search
  • parallel: Enables parallel index construction
  • serde: Enables serialization/deserialization

Quick Start

For most use cases, use Set or Map which own their data:

use pgm_extra::{Set, Map};

// Set (replacement for BTreeSet, includes same read & write interface)
//     (Even though it supports write operations, it's recommended to be used as read-only, due to index rebuild on mutation)
let set: Set<u64> = (0..1_000_000).collect();
assert!(set.contains(&500_000));

// Map (replacement for BTreeMap, includes same read & write interface)
//     (Even though it supports write operations, it's recommended to be used as read-only, due to index rebuild on mutation)
let map: Map<u64, &str> = vec![(1, "one"), (2, "two")].into_iter().collect();
assert_eq!(map.get(&1), Some(&"one"));

Using index types directly

For advanced use cases where you manage data separately:

use pgm_extra::index::{Static, Builder};

let data: Vec<u64> = (0..1_000_000).collect();

// Build index with custom epsilon
let index = Builder::new()
    .epsilon(128)          // Error bound for data level
    .epsilon_recursive(16) // Error bound for recursive levels
    .build(&data)
    .unwrap();

// Point query - find position of key
let key = 500_000u64;
let pos = index.lower_bound(&data, &key);
assert_eq!(data[pos], key);

// Check existence
assert!(index.contains(&data, &key));
assert!(!index.contains(&data, &1_000_001));

// Get approximate position with bounds
let approx = index.search(&key);
println!("Position ~{} in range [{}, {}]", approx.pos, approx.lo, approx.hi);

Index Types

Static

The default multi-level recursive index. Best for large datasets where query performance is critical.

use pgm_extra::index::Static;

let data: Vec<u64> = (0..100_000).collect();
let index = Static::new(&data, 64, 4).unwrap();

println!("Height: {}", index.height());
println!("Segments: {}", index.segments_count());
println!("Size: {} bytes", index.size_in_bytes());

OneLevel

Simple single-level variant. Lower overhead for smaller datasets.

use pgm_extra::index::OneLevel;

let data: Vec<u64> = (0..10_000).collect();
let index = OneLevel::new(&data, 64).unwrap();

Dynamic (requires std feature)

Dynamic variant where mutations are required. Supports insertions and deletions with automatic rebuilding.

use pgm_extra::index::Dynamic;

let mut index = Dynamic::new(64, 4);

// Insert keys (mutation)
index.extend(0..1000);

assert!(index.contains(&500));

index.remove(&500);
assert!(!index.contains(&500));

// Iterate in sorted order
for key in index.iter() {
    println!("{}", key);
}

Epsilon Selection

The epsilon parameter controls the trade-off between index size and query performance:

Epsilon Segments Query Speed Use Case
4-16 Many Fastest Cache-critical workloads
32-64 Moderate Fast Default, good balance
128-256 Few Moderate Memory-constrained
512+ Minimal Slower Extreme memory savings

Parallel Construction

Enable the parallel feature for multi-threaded index building:

use pgm_extra::index::Static;

let data: Vec<u64> = (0..10_000_000).collect();
let index = Static::new_parallel(&data, 64, 4).expect("failed to build index");

Benchmarks

Run benchmarks with:

just bench

Signed Integers

The index correctly handles signed integers by mapping them to unsigned space. As shown in the example below, there are no collisions on absolute values of the keys.

use pgm_extra::index::Static;

let data: Vec<i64> = (-5000..5000).collect();
let index = Static::new(&data, 64, 4).unwrap();

assert!(index.contains(&data, &-100));
assert!(index.contains(&data, &100));

assert!(index.contains(&data, &-5000));
assert!(!index.contains(&data, &5000));

FAQ

Why is SIMD not manually implemented?

Manual SIMD (e.g., using AVX2 or NEON intrinsics) was tested and found to offer no performance benefit over the current implementation. LLVM's ability to auto-vectorize the loop makes it possible to write high-performance code without unsafe blocks or architecture-specific code.

References

License

MIT License

Copyright (c) 2025 Fox Studio (Oskar Cieslik)

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

About

High-performance learned index structures for Rust

Resources

License

Stars

Watchers

Forks

Packages

No packages published