refactor: replace deepsize crate with custom DeepSizeOf trait for Arrow-aware memory accounting by wjones127 · Pull Request #6229 · lance-format/lance

wjones127 · 2026-03-19T16:27:17Z

The external deepsize crate does not correctly account for Arrow buffers
that are shared across Arc references, causing double-counting in cache
size calculations.

This PR introduces a custom DeepSizeOf trait in lance-core::deepsize
with a Context that tracks both Arc and raw buffer pointers in a
unified HashSet. It also adds a lance-derive proc-macro crate for
#[derive(DeepSizeOf)] and removes the dependency on the external
deepsize crate.

Changes:

Add lance-core::deepsize module with custom DeepSizeOf trait and pointer-tracking Context
Add lance-derive proc-macro crate with #[derive(DeepSizeOf)]
Remove external deepsize crate from all crates; update all impls to use lance_core::deepsize
Fix all deep_size_of_children implementations to thread context through (previously some ignored the parameter)
Add DeepSizeOf for arrow_buffer::ScalarBuffer<T> with proper pointer tracking
Fix PlainPostingList and Positions to use deep_size_of_children with context

Implements a DeepSizeOf trait in lance-core that tracks both Arc pointers and raw buffer pointers in a unified HashSet, enabling accurate memory accounting for Arrow types with shared buffers across batches. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replaces the `deepsize` crate dependency with our own `DeepSizeOf` trait in `lance-core::deepsize` and a new `lance-derive` proc-macro crate for `#[derive(DeepSizeOf)]`. This gives us control over pointer tracking for accurate Arrow buffer deduplication in caches. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…uplication All DeepSizeOf implementations must use the provided context parameter to properly track Arc pointers and avoid double-counting shared allocations. Previously, some implementations were ignoring the context parameter, which broke the incremental size calculation used in debug_index_cache. Changes: - Fixed parameter names from `_context` to `context` where the parameter was being used - Fixed type casting for FixedSizeListArray references (removed unnecessary .as_ref() calls) - Ensured all deep_size_of_children implementations thread context through to child calls - Removed unused Array import from pq/storage.rs Files modified: - rust/lance-core/src/datatypes.rs - rust/lance-encoding/src/encodings/logical/primitive/blob.rs - rust/lance-index/src/scalar/btree/flat.rs - rust/lance-index/src/scalar/rtree.rs - rust/lance-index/src/vector/bq/storage.rs - rust/lance-index/src/vector/flat/storage.rs - rust/lance-index/src/vector/ivf/storage.rs - rust/lance-index/src/vector/pq.rs - rust/lance-index/src/vector/pq/storage.rs - rust/lance-index/src/vector/sq/storage.rs Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

- Add DeepSizeOf implementation for arrow_buffer::ScalarBuffer<T> that properly tracks buffer pointers in context to avoid double-counting shared allocations - Fix PlainPostingList and Positions to use deep_size_of_children with context instead of manual size calculations or get_buffer_memory_size - Cast ListArray to &dyn Array to access deep_size_of_children method This fixes the remaining instances where DeepSizeOf was ignoring the context parameter and using less efficient size calculation methods. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions · 2026-03-19T16:29:14Z

Review

Good motivation — the external deepsize crate doesn't track Arc/buffer pointer identity, so shared Arrow buffers get double-counted in cache size calculations. The custom trait with a pointer-tracking Context is the right approach.

Issues to address

P0: dyn Array::deep_size_of_children calls self.to_data() which clones all buffers

Array::to_data() can be expensive — it materializes an ArrayData by cloning Arc<Buffer>s and for some array types does non-trivial work. This is called every time you measure a column's size in a RecordBatch, which happens on every cache insertion/eviction. Consider using Array::get_array_memory_size() for the total and only using the Context for dedup at the Arc<dyn Array> / Arc<RecordBatch> level. Alternatively, use array.to_data() only once and cache the result, or see if ArrayData can be obtained without cloning (e.g. array.into_data() if you own it, or accessing internal data references).

P1: CompressedPostingList still uses get_buffer_memory_size() without context tracking

CompressedPostingList::deep_size_of_children (rust/lance-index/src/scalar/inverted/index.rs) was only half-migrated — it still calls self.blocks.get_buffer_memory_size() and self.positions.as_ref().map(Array::get_buffer_memory_size) directly, bypassing the Context dedup. This means shared buffers in compressed posting lists can still be double-counted.

P1: ScalarBuffer reports len * size_of::<T>() instead of capacity

In deepsize.rs, ScalarBuffer::deep_size_of_children returns self.len() * size_of::<T>(), but the underlying Buffer may have been allocated with more capacity (e.g. after slicing). Other impls in the same file (like Vec) correctly use capacity. For consistency and accuracy, consider using the underlying buffer's capacity.

Minor

The HashMap size estimate uses capacity * (size_of::<K>() + size_of::<V>() + 1), which is a reasonable approximation for hashbrown/Swiss tables but the actual overhead per bucket is closer to 1 byte of control metadata per slot (not per occupied entry). This is fine as an estimate since you're using capacity() which already accounts for the total slot count — just noting it's approximate.
No tests for the derive macro on enums or generics. Consider adding at least one test struct/enum in deepsize.rs tests that uses #[derive(DeepSizeOf)] to verify the proc macro generates correct code.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- ScalarBuffer: use inner buffer capacity() instead of len() * size_of::<T>() since sliced buffers retain their full original allocation - CompressedPostingList: use context-aware deep_size_of_children instead of get_buffer_memory_size(), which bypassed Arc/pointer deduplication - Add comment on dyn Array impl explaining to_data() cost is O(buffers), not O(data) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

PQIndex::code and row_ids (Arc<PrimitiveArray>) and ResidualCheckMockIndex::ret_val (RecordBatch) were using get_array_memory_size(), bypassing context-based dedup. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

codecov · 2026-03-19T17:48:43Z

Codecov Report

❌ Patch coverage is 85.86667% with 53 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance-core/src/deepsize.rs	87.08%	26 Missing and 1 partial ⚠️
rust/lance-derive/src/lib.rs	92.53%	4 Missing and 1 partial ⚠️
rust/lance-index/src/scalar/inverted/index.rs	58.33%	5 Missing ⚠️
rust/lance/src/index/vector/fixture_test.rs	0.00%	2 Missing ⚠️
rust/lance/src/index/vector/pq.rs	81.81%	2 Missing ⚠️
rust/lance-core/src/container/list.rs	0.00%	1 Missing ⚠️
rust/lance-index/src/mem_wal.rs	50.00%	1 Missing ⚠️
rust/lance-index/src/scalar/inverted/builder.rs	0.00%	1 Missing ⚠️
rust/lance-index/src/scalar/json.rs	0.00%	1 Missing ⚠️
rust/lance-index/src/scalar/ngram.rs	66.66%	1 Missing ⚠️
... and 7 more

📢 Thoughts on this report? Let us know!

Conflicts resolved: - Cargo.toml: keep lance-derive dep, bump versions to 4.1.0-beta.0 - inverted/index.rs: use lance_core::deepsize::Context with upstream's CompressedPositionStorage variant matching logic Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

wjones127 and others added 5 commits March 19, 2026 09:18

Fix unused bitvec import and lance-derive version in Cargo.lock

9b07b7d

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions bot added the python label Mar 19, 2026

wjones127 and others added 3 commits March 19, 2026 10:05

Fix stray deepsize::Context reference and update python/Cargo.lock

36be2af

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Fix remaining get_array_memory_size calls in DeepSizeOf impls

861db80

PQIndex::code and row_ids (Arc<PrimitiveArray>) and ResidualCheckMockIndex::ret_val (RecordBatch) were using get_array_memory_size(), bypassing context-based dedup. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

wjones127 marked this pull request as ready for review March 19, 2026 17:44

wjones127 mentioned this pull request Mar 19, 2026

feat: VectorIndex serialization and zero-IO reconstruction #6223

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: replace deepsize crate with custom DeepSizeOf trait for Arrow-aware memory accounting#6229

refactor: replace deepsize crate with custom DeepSizeOf trait for Arrow-aware memory accounting#6229
wjones127 wants to merge 9 commits intolance-format:mainfrom
wjones127:feat/custom-deepsize-trait

wjones127 commented Mar 19, 2026

Uh oh!

github-actions bot commented Mar 19, 2026

Uh oh!

codecov bot commented Mar 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wjones127 commented Mar 19, 2026

Uh oh!

github-actions bot commented Mar 19, 2026

Review

Issues to address

Minor

Uh oh!

codecov bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov bot commented Mar 19, 2026 •

edited

Loading