Skip to content

feat: VectorIndex serialization and zero-IO reconstruction#6223

Open
wjones127 wants to merge 25 commits intolance-format:mainfrom
wjones127:feat/partition-entry-serde
Open

feat: VectorIndex serialization and zero-IO reconstruction#6223
wjones127 wants to merge 25 commits intolance-format:mainfrom
wjones127:feat/partition-entry-serde

Conversation

@wjones127
Copy link
Contributor

@wjones127 wjones127 commented Mar 18, 2026

Depends on #6222.

Add serialization support for IVF vector index state, enabling
reconstruction from cache without re-reading index files.

Changes:

  • partition_serde.rs: Spillable trait with serialize(&self, writer)
    and deserialize(data) for all quantizer types (PQ, SQ, Flat, RabitQ)
  • IvfIndexState / VectorIndexData: serializable snapshot of vector
    index state cached via cacheable_state() on VectorIndex
  • reconstruct_vector_index: rebuilds a live IVFIndex from cached
    IvfIndexState, dispatching on sub-index and quantization types
  • File metadata and sizes are cached during try_new so reconstruction
    avoids both data reads and HEAD requests
  • from_cached constructor on IvfQuantizationStorage to skip global
    buffer reads when reconstructing from cached metadata
  • Zero-IO test verifying reconstruction performs no IO after initial open

The Session's index cache was hardcoded to use Moka. This adds a
CacheBackend trait so users can provide their own cache implementation
(e.g. Redis-backed, disk-backed, shared across processes).

Two-layer design:
- CacheBackend: object-safe async trait with opaque byte keys. This is
  what plugin authors implement (get, insert, invalidate_prefix, clear,
  num_entries, size_bytes).
- LanceCache: typed wrapper handling key construction (prefix + type
  tag), type-safe get/insert, DeepSizeOf size computation, hit/miss
  stats, and concurrent load deduplication.

MokaCacheBackend is the default, preserving existing behavior. Custom
backends are wired through Session::with_index_cache_backend() or
DatasetBuilder::with_index_cache_backend().

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added the enhancement New feature or request label Mar 18, 2026
@github-actions
Copy link
Contributor

PR Review: feat: add serialize/deserialize for IVF PQ partition cache entries

Clean implementation overall — the zero-copy IPC approach is well-suited for cache serde. A few issues to flag:

P1: Integer overflow in deserialization offset arithmetic

In deserialize(), the section boundary calculations use unchecked addition:

let sub_index_end = sub_index_start + header.sub_index_len as usize;
let codebook_end = codebook_start + header.codebook_len as usize;
let storage_end = storage_start + header.storage_len as usize;

If the header contains corrupted or adversarial values (e.g., lengths close to u64::MAX), these additions can wrap around, causing storage_end to be a small value that passes the data.len() < storage_end check. The subsequent Buffer::slice_with_length calls would then read incorrect regions or panic.

Use checked_add and return an error on overflow, same as the defensive pattern already used for trailer_start / footer_start in read_ipc_zero_copy.

P1: Missing Hamming in distance type roundtrip test

test_roundtrip_preserves_distance_type covers L2, Cosine, and Dot but omits Hamming, which is handled in the distance_type_to_u8/u8_to_distance_type mapping. Either add it to the test or document why it's excluded (e.g., PQ storage doesn't support Hamming).

Minor

  • The module is pub mod partition_serde — if this is only intended for internal caching, consider pub(crate) to avoid leaking it as public API.

🤖 Generated with Claude Code

@codecov
Copy link

codecov bot commented Mar 18, 2026

Codecov Report

❌ Patch coverage is 84.53333% with 116 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/index/vector/ivf/partition_serde.rs 84.53% 58 Missing and 58 partials ⚠️

📢 Thoughts on this report? Let us know!

Comment on lines +16 to +18
//! Each IPC section is a complete Arrow IPC file. On deserialization, the IPC
//! sections are read zero-copy using [`FileDecoder`] so that Arrow arrays
//! reference the original buffer directly.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My big worry with this is that the DeepSizeOf will no longer be accurate after a roundtrip.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This concern will be addressed by #6229. As long as we aren't sharing buffers across cache entries, we should be fine.

wjones127 and others added 6 commits March 18, 2026 20:04
Add type_name()/type_id() to CacheKey and UnsizedCacheKey traits so
backends can identify the type of cached entries. Add parse_cache_key()
utility for backends to extract (user_key, type_id) from opaque key
bytes.

CacheKey-based methods now pipe the key's type_id through to the
backend. Non-CacheKey methods use type_id_of::<T>() as a sentinel.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. Remove #[cfg(test)] convenience methods; tests now use CacheKey
   via a TestKey helper, eliminating the parallel method hierarchy.

2. Fix dedup race condition: re-check the cache while holding the
   in-flight lock so no two tasks can both become leader for the
   same key.

3. Use Arc::try_unwrap on the leader error path to preserve the
   original error type when possible.

4. Make invalidate_prefix async instead of fire-and-forget spawn.

5. Replace type_name().as_ptr() with a hash of std::any::TypeId for
   stable type discrimination. Defined once in type_id_of() and used
   by CacheKey::type_id() default.

6. Add dedup to WeakLanceCache::get_or_insert, sharing the in-flight
   map from the parent LanceCache.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Address feedback:

1. Move get_or_insert() onto CacheBackend. The method takes a pinned
   future (not a closure), so LanceCache can type-erase the user's
   non-'static loader before passing it to the backend. Default impl
   does simple get-then-insert; MokaCacheBackend uses moka's built-in
   optionally_get_with for dedup. This eliminates duplicated dedup
   logic and the manual watch-channel machinery.

2. Restore type_name().as_ptr() for type_id derivation on CacheKey.
   Remove standalone type_id_of() function. The derivation lives in
   one place: CacheKey::type_id()/UnsizedCacheKey::type_id().

3. Remove approx_size_bytes from CacheBackend trait and Session debug
   output. Only approx_num_entries remains.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove all methods that bypass CacheKey from WeakLanceCache (get,
insert, get_or_insert, get_unsized, insert_unsized). Remove
insert_unsized/get_unsized from LanceCache. Remove type_tag helper.
All cache access now goes through CacheKey/UnsizedCacheKey.

Make parse_cache_key return (empty, 0) instead of panicking on short
keys.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Restore approx_size_bytes on CacheBackend so DeepSizeOf on LanceCache
reports actual cache memory usage (used by Session::size_bytes). Fixes
test_metadata_cache_size Python test.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@wjones127 wjones127 marked this pull request as ready for review March 19, 2026 19:34
wjones127 and others added 12 commits March 19, 2026 17:07
The type_name().as_ptr() approach for type discrimination was unstable
across crate boundaries due to monomorphization. Replace with an
explicit fn type_id() -> &'static str that each CacheKey impl provides
as a short human-readable literal (e.g. 'Vec<IndexMetadata>', 'Manifest').

Key format changes from user_key\0<8 LE bytes> to user_key\0<type_id str>.
parse_cache_key() now returns (&[u8], &str).
Add IvfIndexState struct and serialization to lance-index, enabling
IVFIndex to export its reconstructable state (IVF model, quantizer
metadata) without non-serializable handles. Add reconstruct_vector_index()
which rebuilds an IVFIndex from cached state by re-opening FileReaders
(cheap with warm metadata cache) instead of re-fetching global buffers
from object storage.

Also adds IvfQuantizationStorage::from_cached() to skip global buffer
reads during reconstruction, and Session::file_metadata_cache() to
expose the metadata cache for the reconstruction context.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reconstructed VectorIndex instances need the original cache key prefix
to share partition entries with the two-tier cache backend. Also adds
LanceCache::with_backend_and_prefix() and WeakLanceCache::prefix().

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Previously, the disk cache codec reconstructed `Arc<dyn VectorIndex>`
from `IvfIndexState` during deserialization, requiring a
`ReconstructionContext` with deferred OnceLock initialization and
sync-to-async runtime juggling. The ObjectStore in that context also
lacked proper credential wrappers.

Now the cache stores `Arc<dyn VectorIndexData>` (serializable state)
instead of `Arc<dyn VectorIndex>` (live index). Lance's
`open_vector_index()` detects cached state and reconstructs using its
own ObjectStore (with credentials) and metadata cache. This eliminates
the ReconstructionContext, OnceLock pattern, and runtime juggling.

Changes:
- Add VectorIndexData trait (lance-index) with write_to/as_any/tag
- Add DeepSizeOf impl for IvfIndexState
- Change VectorIndexCacheKey::ValueType to dyn VectorIndexData
- Add reconstruction-from-cache path in open_vector_index()
- Fix panicking downcast in LanceCache::get_with_id (return None)
- Add Debug/Clone/Copy derives to SubIndexType

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Split cache.rs into submodules (backend, keys, moka, mod)
- Rename CacheKey::type_id() to type_name() across all implementors
- Improve CacheBackend and get_or_insert docs
- Add Spillable trait with writer-based serialize for partition_serde
- Cache file metadata and file sizes to enable zero-IO reconstruction
- Add test_reconstruct_from_cache_zero_io test

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>


Move VectorIndexData, IvfIndexState, partition_serde, cacheable_state,
and zero-IO reconstruction out of this PR to keep it focused on the
pluggable cache backend.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add partition_serde with Spillable trait for serializing IVF partition
entries across all quantizer types (PQ, SQ, Flat, RabitQ). Add
IvfIndexState/VectorIndexData for caching vector index state to disk,
enabling reconstruction without re-reading index files. File metadata
and sizes are cached to avoid both data reads and HEAD requests on
subsequent opens.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@wjones127 wjones127 force-pushed the feat/partition-entry-serde branch from 3cffe37 to 3ead7ca Compare March 20, 2026 23:49
@wjones127 wjones127 changed the title feat: add serialize/deserialize for IVF PQ partition cache entries feat: VectorIndex serialization and zero-IO reconstruction Mar 20, 2026
@github-actions
Copy link
Contributor

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

wjones127 and others added 4 commits March 20, 2026 17:09
The method was renamed in lance-format#6209 but the test call site in v2.rs was not
updated during the merge.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The file_sizes parameter on IVFIndex::try_new and the file_size_map()
usage in open_vector_index were from merged PR lance-format#5497, not the
serialization PR. Restoring them avoids unnecessary HEAD requests.
Also restores vector index cache check in open_generic_index.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant