Skip to content

refactor: replace deepsize crate with custom DeepSizeOf trait for Arrow-aware memory accounting#6229

Open
wjones127 wants to merge 9 commits intolance-format:mainfrom
wjones127:feat/custom-deepsize-trait
Open

refactor: replace deepsize crate with custom DeepSizeOf trait for Arrow-aware memory accounting#6229
wjones127 wants to merge 9 commits intolance-format:mainfrom
wjones127:feat/custom-deepsize-trait

Conversation

@wjones127
Copy link
Contributor

The external deepsize crate does not correctly account for Arrow buffers
that are shared across Arc references, causing double-counting in cache
size calculations.

This PR introduces a custom DeepSizeOf trait in lance-core::deepsize
with a Context that tracks both Arc and raw buffer pointers in a
unified HashSet. It also adds a lance-derive proc-macro crate for
#[derive(DeepSizeOf)] and removes the dependency on the external
deepsize crate.

Changes:

  • Add lance-core::deepsize module with custom DeepSizeOf trait and pointer-tracking Context
  • Add lance-derive proc-macro crate with #[derive(DeepSizeOf)]
  • Remove external deepsize crate from all crates; update all impls to use lance_core::deepsize
  • Fix all deep_size_of_children implementations to thread context through (previously some ignored the parameter)
  • Add DeepSizeOf for arrow_buffer::ScalarBuffer<T> with proper pointer tracking
  • Fix PlainPostingList and Positions to use deep_size_of_children with context

wjones127 and others added 5 commits March 19, 2026 09:18
Implements a DeepSizeOf trait in lance-core that tracks both Arc pointers
and raw buffer pointers in a unified HashSet, enabling accurate memory
accounting for Arrow types with shared buffers across batches.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replaces the `deepsize` crate dependency with our own `DeepSizeOf` trait
in `lance-core::deepsize` and a new `lance-derive` proc-macro crate for
`#[derive(DeepSizeOf)]`. This gives us control over pointer tracking for
accurate Arrow buffer deduplication in caches.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…uplication

All DeepSizeOf implementations must use the provided context parameter to properly
track Arc pointers and avoid double-counting shared allocations. Previously, some
implementations were ignoring the context parameter, which broke the incremental
size calculation used in debug_index_cache.

Changes:
- Fixed parameter names from `_context` to `context` where the parameter was being used
- Fixed type casting for FixedSizeListArray references (removed unnecessary .as_ref() calls)
- Ensured all deep_size_of_children implementations thread context through to child calls
- Removed unused Array import from pq/storage.rs

Files modified:
- rust/lance-core/src/datatypes.rs
- rust/lance-encoding/src/encodings/logical/primitive/blob.rs
- rust/lance-index/src/scalar/btree/flat.rs
- rust/lance-index/src/scalar/rtree.rs
- rust/lance-index/src/vector/bq/storage.rs
- rust/lance-index/src/vector/flat/storage.rs
- rust/lance-index/src/vector/ivf/storage.rs
- rust/lance-index/src/vector/pq.rs
- rust/lance-index/src/vector/pq/storage.rs
- rust/lance-index/src/vector/sq/storage.rs

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
- Add DeepSizeOf implementation for arrow_buffer::ScalarBuffer<T> that properly
  tracks buffer pointers in context to avoid double-counting shared allocations
- Fix PlainPostingList and Positions to use deep_size_of_children with context
  instead of manual size calculations or get_buffer_memory_size
- Cast ListArray to &dyn Array to access deep_size_of_children method

This fixes the remaining instances where DeepSizeOf was ignoring the context
parameter and using less efficient size calculation methods.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Contributor

Review

Good motivation — the external deepsize crate doesn't track Arc/buffer pointer identity, so shared Arrow buffers get double-counted in cache size calculations. The custom trait with a pointer-tracking Context is the right approach.

Issues to address

P0: dyn Array::deep_size_of_children calls self.to_data() which clones all buffers

Array::to_data() can be expensive — it materializes an ArrayData by cloning Arc<Buffer>s and for some array types does non-trivial work. This is called every time you measure a column's size in a RecordBatch, which happens on every cache insertion/eviction. Consider using Array::get_array_memory_size() for the total and only using the Context for dedup at the Arc<dyn Array> / Arc<RecordBatch> level. Alternatively, use array.to_data() only once and cache the result, or see if ArrayData can be obtained without cloning (e.g. array.into_data() if you own it, or accessing internal data references).

P1: CompressedPostingList still uses get_buffer_memory_size() without context tracking

CompressedPostingList::deep_size_of_children (rust/lance-index/src/scalar/inverted/index.rs) was only half-migrated — it still calls self.blocks.get_buffer_memory_size() and self.positions.as_ref().map(Array::get_buffer_memory_size) directly, bypassing the Context dedup. This means shared buffers in compressed posting lists can still be double-counted.

P1: ScalarBuffer reports len * size_of::<T>() instead of capacity

In deepsize.rs, ScalarBuffer::deep_size_of_children returns self.len() * size_of::<T>(), but the underlying Buffer may have been allocated with more capacity (e.g. after slicing). Other impls in the same file (like Vec) correctly use capacity. For consistency and accuracy, consider using the underlying buffer's capacity.

Minor

  • The HashMap size estimate uses capacity * (size_of::<K>() + size_of::<V>() + 1), which is a reasonable approximation for hashbrown/Swiss tables but the actual overhead per bucket is closer to 1 byte of control metadata per slot (not per occupied entry). This is fine as an estimate since you're using capacity() which already accounts for the total slot count — just noting it's approximate.

  • No tests for the derive macro on enums or generics. Consider adding at least one test struct/enum in deepsize.rs tests that uses #[derive(DeepSizeOf)] to verify the proc macro generates correct code.

wjones127 and others added 3 commits March 19, 2026 10:05
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- ScalarBuffer: use inner buffer capacity() instead of len() * size_of::<T>()
  since sliced buffers retain their full original allocation
- CompressedPostingList: use context-aware deep_size_of_children instead of
  get_buffer_memory_size(), which bypassed Arc/pointer deduplication
- Add comment on dyn Array impl explaining to_data() cost is O(buffers), not O(data)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
PQIndex::code and row_ids (Arc<PrimitiveArray>) and ResidualCheckMockIndex::ret_val
(RecordBatch) were using get_array_memory_size(), bypassing context-based dedup.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@wjones127 wjones127 marked this pull request as ready for review March 19, 2026 17:44
Conflicts resolved:
- Cargo.toml: keep lance-derive dep, bump versions to 4.1.0-beta.0
- inverted/index.rs: use lance_core::deepsize::Context with upstream's
  CompressedPositionStorage variant matching logic

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant