refactor: replace deepsize crate with custom DeepSizeOf trait for Arrow-aware memory accounting#6229
refactor: replace deepsize crate with custom DeepSizeOf trait for Arrow-aware memory accounting#6229wjones127 wants to merge 9 commits intolance-format:mainfrom
Conversation
Implements a DeepSizeOf trait in lance-core that tracks both Arc pointers and raw buffer pointers in a unified HashSet, enabling accurate memory accounting for Arrow types with shared buffers across batches. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replaces the `deepsize` crate dependency with our own `DeepSizeOf` trait in `lance-core::deepsize` and a new `lance-derive` proc-macro crate for `#[derive(DeepSizeOf)]`. This gives us control over pointer tracking for accurate Arrow buffer deduplication in caches. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…uplication All DeepSizeOf implementations must use the provided context parameter to properly track Arc pointers and avoid double-counting shared allocations. Previously, some implementations were ignoring the context parameter, which broke the incremental size calculation used in debug_index_cache. Changes: - Fixed parameter names from `_context` to `context` where the parameter was being used - Fixed type casting for FixedSizeListArray references (removed unnecessary .as_ref() calls) - Ensured all deep_size_of_children implementations thread context through to child calls - Removed unused Array import from pq/storage.rs Files modified: - rust/lance-core/src/datatypes.rs - rust/lance-encoding/src/encodings/logical/primitive/blob.rs - rust/lance-index/src/scalar/btree/flat.rs - rust/lance-index/src/scalar/rtree.rs - rust/lance-index/src/vector/bq/storage.rs - rust/lance-index/src/vector/flat/storage.rs - rust/lance-index/src/vector/ivf/storage.rs - rust/lance-index/src/vector/pq.rs - rust/lance-index/src/vector/pq/storage.rs - rust/lance-index/src/vector/sq/storage.rs Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
- Add DeepSizeOf implementation for arrow_buffer::ScalarBuffer<T> that properly tracks buffer pointers in context to avoid double-counting shared allocations - Fix PlainPostingList and Positions to use deep_size_of_children with context instead of manual size calculations or get_buffer_memory_size - Cast ListArray to &dyn Array to access deep_size_of_children method This fixes the remaining instances where DeepSizeOf was ignoring the context parameter and using less efficient size calculation methods. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ReviewGood motivation — the external Issues to addressP0:
P1:
P1: In Minor
|
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- ScalarBuffer: use inner buffer capacity() instead of len() * size_of::<T>() since sliced buffers retain their full original allocation - CompressedPostingList: use context-aware deep_size_of_children instead of get_buffer_memory_size(), which bypassed Arc/pointer deduplication - Add comment on dyn Array impl explaining to_data() cost is O(buffers), not O(data) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
PQIndex::code and row_ids (Arc<PrimitiveArray>) and ResidualCheckMockIndex::ret_val (RecordBatch) were using get_array_memory_size(), bypassing context-based dedup. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
Conflicts resolved: - Cargo.toml: keep lance-derive dep, bump versions to 4.1.0-beta.0 - inverted/index.rs: use lance_core::deepsize::Context with upstream's CompressedPositionStorage variant matching logic Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The external
deepsizecrate does not correctly account for Arrow buffersthat are shared across
Arcreferences, causing double-counting in cachesize calculations.
This PR introduces a custom
DeepSizeOftrait inlance-core::deepsizewith a
Contextthat tracks bothArcand raw buffer pointers in aunified
HashSet. It also adds alance-deriveproc-macro crate for#[derive(DeepSizeOf)]and removes the dependency on the externaldeepsizecrate.Changes:
lance-core::deepsizemodule with customDeepSizeOftrait and pointer-trackingContextlance-deriveproc-macro crate with#[derive(DeepSizeOf)]deepsizecrate from all crates; update all impls to uselance_core::deepsizedeep_size_of_childrenimplementations to threadcontextthrough (previously some ignored the parameter)DeepSizeOfforarrow_buffer::ScalarBuffer<T>with proper pointer trackingPlainPostingListandPositionsto usedeep_size_of_childrenwith context