RFC: Save order of secondary indexes to snapshot #11001

mkostoevr · 2024-12-26T09:58:11Z

mkostoevr
Dec 26, 2024
Collaborator

Reviewers

The problem

The issue is described in #10847: since we save tuples in snapshot in PK order, we have to sort them to build each secondary key (and the sorting process has n*log(n) complexity). Let's save the order of secondary keys in the snapshot to reduce the build complexity to O(n) (this, in theory, should speed-up recovery of secondary keys).

The algorithm

As the issue suggests, let's write the order of secondary indexes to the snapshot. The algorithm is:

Iterate over each space index and write tuple addresses in its order. After this step, such entry in snapshot will appear:

+--------------------------------------------------+
| [0xffff00001111, 0xffff00002222, 0xffff00003333] | - primary index
| [0xffff00001111, 0xffff00003333, 0xffff00002222] | - secondary index
+--------------------------------------------------+

After recovering the primary index (or even during its recovery), we can use its serialized representation to build a mapping (hash-table) from the old tuple addresses to the new ones. Imagine that the new primary index has such tuple addresses in this order: [0xa3, 0xb2, 0xc1]. Then we know that the first tuple in primary index had address 0x01 and its new address is 0xa3, the second one had 0x02 and the new address is 0xb2 and so on. So we can easily build the mapping:

{0xffff00001111: 0xffff0000aaaa, 0xffff00002222: 0xffff0000bbbb, 0xffff00003333: 0xffff0000cccc}

Update all arrays of orders in secondary indexes using the tuple address mapping - and now we have an array of actual tuples sorted in a secondary index order, so we don't need to use parallel qsort and can simply build an index. In our example, the secondary index in snapshot was [0xffff00001111, 0xffff00003333, 0xffff00002222] and now it becomes [0xffff0000aaaa, 0xffff0000cccc, 0xffff0000bbbb] - array of actual tuples in required order.

Extra: hints

If the index we're loading the new way is TREE with hints enabled then it's not reasonable to use this approach unless we save the tuple hints along with tuple pointers in the index sort data files. The reason is some CPU cache effects described below.

Details on the cache effects.

Currently we perform the following steps on recovery:

Read the snapshot, create tuples in PK order and put them into the PK build array to perform fast tree build.
Then we fill the SK build array with tuples in the same order as PK.
1. When we insert a tuple into the build array of the index with hints enabled we have to calculate the hints of the added tuple.
2. In order to calculate the hint we have to access the tuple data.
3. Since the tuples are loaded in PK order, they're located in memory sequentially in this order.
4. So when we insert tuples into SK build array in PK order, we traverse the memory in one direction to calculate hints and have the CPU prefetcher reduce the amount of cache misses.
5. Only after that we do tt_sort to reorder the array the way it should be in SK.

But if we use the new approach, we insert the index data into the build array right in SK order. That means, since the tuples are located in memory in PK order, in order to calculate hint of each tuple inserted into the SK, we need a lot of random memory accesses. This destructs the gain we've received for O(n) build array generation: in our PoC we had 10 seconds wasted in a single data load instruction.

So the approach is useless for indexes with hints unless we save the hints in the same index file, so we don't have to calculate them in runtime.

Extra: multikey indexes

The multikey indexes have the same problem since we have to access the tuple data in order to decide if we should include a particular copy of the tuple into the index (due to the exclude_null option), but also, the hints are required to specify to tuple order in the index (the tuple pointer by itself does not give the information about which multikey member of the tuple is meant to be located here). So the multikey hints are to be saved in the sort data files too.

Extra: functional keys

Here we have the same problem as for multikey and regular indexes with hints, but this one does not seem to have a solution, so let's totally disable the feature for functional indexes.

Summary: the data to be stored

8 bytes per tuple in PK (tuple pointers in index order).
8/16 bytes per tuple in Sks (tuple pointers along with hints if enabled).

Implementation details

Here's the proposed sequence of events:

On initial recovery: start building the old to new tuple addresses map by reading the PK sort data during recovery.
On final recovery: build all indexes with the feature enabled and the sort data existing and proceed the recovery.
Replace in both PK and all indexes that had been loaded the new way until recovery is complete.
Once the recovery is complete we build other SKs as before.

The sort data must only be used on non-system spaces if the they have no before_replace triggers and no force_recovery specified (system spaces are less likely to benefit from it). Also, it can only be safely used during recovery if the _index space has no before_replace triggers registered, cause in the opposite case it could change the index we saved sort data for so that the data is not applicable to it anymore.

The storage

Let's store the sort data (a binary sequence of 8-byte pointers[ and hints]) in the <vclock_signature>.sortdata file. The file is created along with the regular snapshot file in memtx_engine_begin_checkpoint, but only for TREE index. The structure:

SORTDATA\n
1\n
Version: 3.4.0-entrypoint-140-ga0c4a066af\n
Instance: 9e67285e-174a-4f3b-8653-561b817bbc64\n
VClock: {1: 1000003}\n
\n
Cardinality: 00000000000001000000\n
Entries: 2\n
512/1: 00000000007a132b, 0000000000f42400, 00000000000001000000\n
512/0: 000000000000012b, 00000000007a1200, 00000000000001000000\n
\n
<the sort data of specified offsets>

More thorougly:

The first lines are the XLOG header with the "SORTDATA" magic.
The "Cardinality" key specifies the total tuple count in spaces with saved data.
The "Entries" key specifies the amount of keys specifying the sort data properties after it.
The next keys specify (space, index) sort data's (offset, size, tuple count) specifiers. This line is meant to be extensible by new tags, like 512/1: 0x0000000000000041, 0x0000000000000526, 00000000000000001536, gz, 0x0000000000003000\n to specify e. g. the compression algorithm and the original size.
The empty line is the end of keys.

Alternativs considered.

Save the information in the snapshot file metadata: this way we're limited by the max metadata header size, which is pretty small (~2KB as for ef3775a).
Save the information as one of entries in the snapshot file: we can't create a new fixheader type since it's strictly checked. A new request type can't be used in it either, cause memtx only accepts inserts (along with RAFT stuff) there.
A system blackhole space.

The idea is to insert the information into a system blackhole space in the end of the snapshot. If the snapshot does not contain the space, then the recovery is performed the old way (compatibility with old snapshots). Don't write the space on box.snapshot() after the downgrade (for backward compatibility). The blackhole space tuple format:

Field type Description

MP_BOOL Set to true if the tuple is the last one for the given index.

MP_UINT Space ID.

MP_UINT Index ID.

MP_STRING The sort data chunk.

The SK sort data of a particular index consists of a number of such tuples, this is required to reduce the tuple arena usage. Last of the tuples has the is_last flag set to true to mark the point where the index has all the information required and can be built using the new approach.

Since the space with the sort data is filled in the end of the snapshot, the old tuple addresses must be saved in some another way, so that we can create the old to new address map during PK recovery. Let's save the PK tuple pointers in the snapshot right inside the headers of INSERT entries (right next to the timestamp).

Alternative considered.

We could create another space for such data and write it before user spaces, but it will require extra RAM to have the PK tuple addresses persistently until we got to build the space the information is supposed for. So we better to store the information along with the tuples we load and forget about it right after use.

Comparison to option 0:
➖ The sort data is built into the snapshot, so the latter can't be moved (backed-up) separately.
➖ The sort data is first instantiated into a tuple and then provided to indexes (extra indirection).
➖ No nice way to send the data to replicas unless we generate it from scratch or hack the xlog reader.
➖ Additional ~1 byte per PK tuple of persistent storage.

Getting tuple pointers from the read view

It would be nice (although, probably not necessary, need to be measured) to be able to receive both tuple data and pointer to the raw tuple from the index in one step. So let's return the tuple pointer along with the data in the struct read_view_tuple (add a new struct tuple *ptr field).

Alternative considered.

Introduce a new struct tuple **ptr output parameter to the index_read_view_iterator_base::next_raw callback.

The configuration

The approach only has sense if a database has many secondary keys of TREE type and a performant persistent storage, because it increases the initial recovery time and relies on the fast direct sort data read, so a new boolean memtx_sort_data_enabled (or memtx.sort_data_enabled) configuration option is proposed to specify whether to use or not to use the new aproach in building secondary keys and writing the snapshot.

The configuration variable is to be changeable in runtime. If one does not want to use the sort data on recovery but want to write it during snapshoting, it can be set to false initially and then recofigured to true prior to box.snapshot(). Works the opposite way too.

Expected results

Checkpointing

➖ Additional data to be written to the persistent storage: 8 bytes per tuple + 8 or 16 bytes per tuple in SK.
➖ Additional RAM to be required if we change the secondary keys during checkpointing (now we only read-view PK, but will have to read-view SKs too).

Recovery

➖ Additional data to be read from the persistent storage: 8 bytes per tuple + 8 or 16 bytes per tuple in SK.
➖ The old to new tuple address map is to be created and filled (more RAM and CPU time required).
➕ Build of secondary indexes is significantly sped-up.

Practical results (storage option 0, but they're simple binary files)

Configuration

CPU: Zen 5, 8 dedicated cores
Storage: NVME (7000 MBps read, 5000 MBps write)
Part 1 PK and part 2 SK, Tuples with 2 random unsigned fields.

Checkpointing

Tuple count	Regular approach	New approach
1 000 000	1.070s 0.119GB	1.070s 0.119GB
10 000 000	12.400s 0.771GB	12.520s 0.771GB
100 000 000	2m31.930s 7.305GB	2m33.270s 7.305GB
1000 000 000	34m42s 72.623GB	35m07s 72.623GB

Recovery

Tuple count	`tt_sort`, 1 thread	`tt_sort`, 2 threads	New approach
1 000 000	Time: 0.400s Initial recovery: 0.302s SK build: 0.098s Mem: 0.131GB	Time: 0.375s Initial recovery: 0.314s SK build: 0.061s Mem: 0.131GB	Time: 0.400s Initial recovery: 0.357s SK build: 0.043s Mem: 0.139GB
10 000 000	Time: 3.370s Initial recovery: 2.266s SK build: 1.104s Mem: 0.852GB	Time: 2.980s Initial recovery: 2.281s SK build: 0.699s Mem: 0.852GB	Time: 2.890s Initial recovery: 2.666s SK build: 0.224s Mem: 1.209GB
100 000 000	Time: 35.420s Initial recovery: 21.776s SK build: 13.644s Mem: 8.121GB	Time: 29.500s Initial recovery: 21.694s SK build: 7.806s Mem: 8.120GB	Time: 28.920s Initial recovery: 25.619s SK build: 3.301s Mem: 11.312GB
1000 000 000	Time: 7m8s Initial recovery: 3m19s SK build: 3m49s Mem: 80.897GB	Time: 5m33s Initial recovery: 3m36s SK build: 1m57s Mem: 80.897GB	Time: 4m56s Initial recovery: 4m10s SK build: 46s Mem: 101.628GB

Perf details

Memory overhead:

1M: not reliable, too fast
10M: 36 bytes per tuple
100M: 32 bytes per tuple
1000M: 21 bytes per tuple
min got in side tests: ~21 bytes per tuple.
max got in side tests: ~46 bytes per tuple.

SK build time:

1M: similar to 4 thread sort
10M: faster than sort can get with 1-16 cores
100M: similar to 12 thread sort
1000M: similar to 8 thread sort

Initial recovery overhead:

1M: 0.04s
10M: 0.4s
100M: 4s
1000M: 40s
Relative: 20%

locker · 2025-01-15T11:41:52Z

locker
Jan 15, 2025
Maintainer

Will this feature work for functional and multikey indexes? Please add this information to the RFC.

3 replies

mkostoevr Feb 4, 2025
Collaborator Author

We have perf degradation if we access the tuple data during build array generation using the new approach (see the spoiler under the Extra: hints section).

So it's better to be disabled for functional indexes, but we can safely use the feature for multikey indexes by saving hints (multikey indices) in the .sortdata file with along with tuple pointers. More information on that in the Extra: multikey indexes section.

drewdzzz Feb 5, 2025
Collaborator

I don't get the problem with functional indexes - couldn't we just save the hint (extracted functional key) next to the tuple?

mkostoevr Feb 5, 2025
Collaborator Author

Technically we can, can't say a thing about the performance impact of this though: this will require a bit more sophisticated procedure to build the index such way (we can't directly transform the sort data into the build array as for non-pointer hints). This must be investigated further.

I didn't bother with it for now as it seems as a rarely used functionality not worth the effort to me, but I'm open for discussion if it's the opposite.

unera · 2025-02-11T17:11:29Z

unera
Feb 11, 2025
Collaborator

The configuration

I think it is enough to have

an option that enable/disable storing TREEs into .index files
an option that disable loading .index files

The other variants (per space, etc) are overkill.

PS: the second option is optional: a user could remove .index files, if he wants

4 replies

locker Feb 12, 2025
Maintainer

Agree, per space options aren't really needed. Actually, I overlooked them while reviewing the RFC - I thought there would only be one global option - memtx_enable_sort_data.

Totktonada Feb 13, 2025
Maintainer

We don't have 'enable' or 'disable' words in other option names (in box.cfg options). There are some <smth>_enabled options, though.
Also, it is not clear from the option name, whether it affects reading of the secondary index data or only writing. If it is only about writing, I would use explicit 'write' or 'store' instead of the vague 'enable'.
'Sort data' doesn't reveal much about what the option is doing, IMHO. 'Store indexes' has more sense for me, for example: memtx_store_index. In the YAML configuration we can make it even more explicit: memtx.store_index = one-of(primary, all).
The configuration section in the RFC doesn't make it clear how box.cfg option is named and how YAML config option is named. It is better to show both names in the RFC to avoid ambiguity.

mkostoevr Feb 13, 2025
Collaborator Author

What about having a composite one: memtx_sk_sort_data = { use = true, save = true } (memtx.sk_sort_data.[use|save] in the YAML config). This will keep the window for per-space specification open.

mkostoevr Apr 29, 2025
Collaborator Author

Stuck with memtx_sort_data_enabled option controlling whether to read and write the sort data.

sergepetrenko · 2025-07-03T08:45:44Z