Replies: 18 comments 34 replies
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
-
Will this feature work for functional and multikey indexes? Please add this information to the RFC. |
Beta Was this translation helpful? Give feedback.
3 replies
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
-
I think it is enough to have
The other variants (per space, etc) are overkill. PS: the second option is optional: a user could remove |
Beta Was this translation helpful? Give feedback.
4 replies
This comment has been hidden.
This comment has been hidden.
-
Closing as approved for development. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Reviewers
ToC
The problem
The issue is described in #10847: since we save tuples in snapshot in PK order, we have to sort them to build each secondary key (and the sorting process has
n*log(n)
complexity). Let's save the order of secondary keys in the snapshot to reduce the build complexity toO(n)
(this, in theory, should speed-up recovery of secondary keys).The algorithm
As the issue suggests, let's write the order of secondary indexes to the snapshot. The algorithm is:
[0xa3, 0xb2, 0xc1]
. Then we know that the first tuple in primary index had address0x01
and its new address is0xa3
, the second one had0x02
and the new address is0xb2
and so on. So we can easily build the mapping:[0xffff00001111, 0xffff00003333, 0xffff00002222]
and now it becomes[0xffff0000aaaa, 0xffff0000cccc, 0xffff0000bbbb]
- array of actual tuples in required order.Extra: hints
If the index we're loading the new way is TREE with hints enabled then it's not reasonable to use this approach unless we save the tuple hints along with tuple pointers in the index sort data files. The reason is some CPU cache effects described below.
Details on the cache effects.
Currently we perform the following steps on recovery:
tt_sort
to reorder the array the way it should be in SK.But if we use the new approach, we insert the index data into the build array right in SK order. That means, since the tuples are located in memory in PK order, in order to calculate hint of each tuple inserted into the SK, we need a lot of random memory accesses. This destructs the gain we've received for O(n) build array generation: in our PoC we had 10 seconds wasted in a single data load instruction.
So the approach is useless for indexes with hints unless we save the hints in the same index file, so we don't have to calculate them in runtime.
Extra: multikey indexes
The multikey indexes have the same problem since we have to access the tuple data in order to decide if we should include a particular copy of the tuple into the index (due to the
exclude_null
option), but also, the hints are required to specify to tuple order in the index (the tuple pointer by itself does not give the information about which multikey member of the tuple is meant to be located here). So the multikey hints are to be saved in the sort data files too.Extra: functional keys
Here we have the same problem as for multikey and regular indexes with hints, but this one does not seem to have a solution, so let's totally disable the feature for functional indexes.
Summary: the data to be stored
Implementation details
Here's the proposed sequence of events:
The sort data must only be used on non-system spaces if the they have no
before_replace
triggers and noforce_recovery
specified (system spaces are less likely to benefit from it). Also, it can only be safely used during recovery if the_index
space has nobefore_replace
triggers registered, cause in the opposite case it could change the index we saved sort data for so that the data is not applicable to it anymore.The storage
Let's store the sort data (a binary sequence of 8-byte pointers[ and hints]) in the
<vclock_signature>.sortdata
file. The file is created along with the regular snapshot file inmemtx_engine_begin_checkpoint
, but only for TREE index. The structure:More thorougly:
512/1: 0x0000000000000041, 0x0000000000000526, 00000000000000001536, gz, 0x0000000000003000\n
to specify e. g. the compression algorithm and the original size.Alternativs considered.
Save the information in the snapshot file metadata: this way we're limited by the max metadata header size, which is pretty small (~2KB as for ef3775a).
Save the information as one of entries in the snapshot file: we can't create a new fixheader type since it's strictly checked. A new request type can't be used in it either, cause memtx only accepts inserts (along with RAFT stuff) there.
A system blackhole space.
The idea is to insert the information into a system blackhole space in the end of the snapshot. If the snapshot does not contain the space, then the recovery is performed the old way (compatibility with old snapshots). Don't write the space on
box.snapshot()
after the downgrade (for backward compatibility). The blackhole space tuple format:MP_BOOL
true
if the tuple is the last one for the given index.MP_UINT
MP_UINT
MP_STRING
The SK sort data of a particular index consists of a number of such tuples, this is required to reduce the tuple arena usage. Last of the tuples has the
is_last
flag set totrue
to mark the point where the index has all the information required and can be built using the new approach.Since the space with the sort data is filled in the end of the snapshot, the old tuple addresses must be saved in some another way, so that we can create the old to new address map during PK recovery. Let's save the PK tuple pointers in the snapshot right inside the headers of
INSERT
entries (right next to the timestamp).Alternative considered.
We could create another space for such data and write it before user spaces, but it will require extra RAM to have the PK tuple addresses persistently until we got to build the space the information is supposed for. So we better to store the information along with the tuples we load and forget about it right after use.
Comparison to option 0:
➖ The sort data is built into the snapshot, so the latter can't be moved (backed-up) separately.
➖ The sort data is first instantiated into a tuple and then provided to indexes (extra indirection).
➖ No nice way to send the data to replicas unless we generate it from scratch or hack the xlog reader.
➖ Additional ~1 byte per PK tuple of persistent storage.
Getting tuple pointers from the read view
It would be nice (although, probably not necessary, need to be measured) to be able to receive both tuple data and pointer to the raw tuple from the index in one step. So let's return the tuple pointer along with the data in the
struct read_view_tuple
(add a newstruct tuple *ptr
field).Alternative considered.
Introduce a new
struct tuple **ptr
output parameter to theindex_read_view_iterator_base::next_raw
callback.The configuration
The approach only has sense if a database has many secondary keys of TREE type and a performant persistent storage, because it increases the initial recovery time and relies on the fast direct sort data read, so a new boolean
memtx_sort_data_enabled
(ormemtx.sort_data_enabled
) configuration option is proposed to specify whether to use or not to use the new aproach in building secondary keys and writing the snapshot.The configuration variable is to be changeable in runtime. If one does not want to use the sort data on recovery but want to write it during snapshoting, it can be set to
false
initially and then recofigured totrue
prior tobox.snapshot()
. Works the opposite way too.Expected results
Checkpointing
➖ Additional data to be written to the persistent storage: 8 bytes per tuple + 8 or 16 bytes per tuple in SK.
➖ Additional RAM to be required if we change the secondary keys during checkpointing (now we only read-view PK, but will have to read-view SKs too).
Recovery
➖ Additional data to be read from the persistent storage: 8 bytes per tuple + 8 or 16 bytes per tuple in SK.
➖ The old to new tuple address map is to be created and filled (more RAM and CPU time required).
➕ Build of secondary indexes is significantly sped-up.
Practical results (storage option 0, but they're simple binary files)
Configuration
CPU: Zen 5, 8 dedicated cores
Storage: NVME (7000 MBps read, 5000 MBps write)
Part 1 PK and part 2 SK, Tuples with 2 random unsigned fields.
Checkpointing
0.119GB
0.119GB
0.771GB
0.771GB
7.305GB
7.305GB
72.623GB
72.623GB
Recovery
tt_sort
, 1 threadtt_sort
, 2 threadsInitial recovery: 0.302s
SK build: 0.098s
Mem: 0.131GB
Initial recovery: 0.314s
SK build: 0.061s
Mem: 0.131GB
Initial recovery: 0.357s
SK build: 0.043s
Mem: 0.139GB
Initial recovery: 2.266s
SK build: 1.104s
Mem: 0.852GB
Initial recovery: 2.281s
SK build: 0.699s
Mem: 0.852GB
Initial recovery: 2.666s
SK build: 0.224s
Mem: 1.209GB
Initial recovery: 21.776s
SK build: 13.644s
Mem: 8.121GB
Initial recovery: 21.694s
SK build: 7.806s
Mem: 8.120GB
Initial recovery: 25.619s
SK build: 3.301s
Mem: 11.312GB
Initial recovery: 3m19s
SK build: 3m49s
Mem: 80.897GB
Initial recovery: 3m36s
SK build: 1m57s
Mem: 80.897GB
Initial recovery: 4m10s
SK build: 46s
Mem: 101.628GB
Perf details
Memory overhead:
SK build time:
Initial recovery overhead:
Beta Was this translation helpful? Give feedback.
All reactions