Skip to content

Commit 0148739

Browse files
committed
docs: more architecture guide cleanups
1 parent 686d397 commit 0148739

File tree

8 files changed

+165
-136
lines changed

8 files changed

+165
-136
lines changed

docs/architecture/sql-data.md

Lines changed: 40 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -1,38 +1,45 @@
11
# SQL Data Model
22

3-
The SQL data model is toyDB's representation of user data. It is made up of data types and schemas.
3+
The SQL data model represents user data in tables and rows. It is made up of data types and schemas,
4+
in the [`sql::types`](https://github.com/erikgrinaker/toydb/tree/686d3971a253bfc9facc2ba1b0e716cff5c109fb/src/sql/types)
5+
module.
46

57
## Data Types
68

7-
toyDB supports four basic scalar data types as `sql::types::DataType`: booleans, floats, integers,
9+
toyDB supports four basic scalar data types as `sql::types::DataType`: booleans, integers, floats,
810
and strings.
911

1012
https://github.com/erikgrinaker/toydb/blob/b2fe7b76ee634ca6ad31616becabfddb1c03d34b/src/sql/types/value.rs#L15-L27
1113

12-
Concrete values are represented as `sql::types::Value`, using corresponding Rust types. toyDB also
13-
supports SQL `NULL` values, i.e. unknown values, following the rules of
14+
Specific values are represented as `sql::types::Value`, using the corresponding Rust types. toyDB
15+
also supports SQL `NULL` values, i.e. unknown values, following the rules of
1416
[three-valued logic](https://en.wikipedia.org/wiki/Three-valued_logic).
1517

1618
https://github.com/erikgrinaker/toydb/blob/b2fe7b76ee634ca6ad31616becabfddb1c03d34b/src/sql/types/value.rs#L40-L64
1719

18-
The `Value` type provides basic formatting, conversion, and mathematical operations. It also
19-
specifies comparison and ordering semantics, but these are subtly different from the SQL semantics.
20-
For example, in Rust code `Value::Null == Value::Null` yields `true`, while in SQL `NULL = NULL`
21-
yields `NULL`. This mismatch is necessary for the Rust code to properly detect and process `Null`
22-
values, and the desired SQL semantics are implemented higher up in the SQL execution engine (we'll
23-
get back to this later).
20+
The `Value` type provides basic formatting, conversion, and mathematical operations.
21+
22+
https://github.com/erikgrinaker/toydb/blob/686d3971a253bfc9facc2ba1b0e716cff5c109fb/src/sql/types/value.rs#L68-L79
23+
24+
https://github.com/erikgrinaker/toydb/blob/686d3971a253bfc9facc2ba1b0e716cff5c109fb/src/sql/types/value.rs#L164-L370
25+
26+
It also specifies comparison and ordering semantics, but these are subtly different from the SQL
27+
semantics. For example, in Rust code `Value::Null == Value::Null` yields `true`, while in SQL
28+
`NULL = NULL` yields `NULL`. This mismatch is necessary for the Rust code to properly detect and
29+
process `Null` values, and the desired SQL semantics are implemented during expression evaluation
30+
which we'll cover below.
2431

2532
https://github.com/erikgrinaker/toydb/blob/b2fe7b76ee634ca6ad31616becabfddb1c03d34b/src/sql/types/value.rs#L91-L162
2633

27-
During execution, a row of values will be represented as `sql::types::Row`, with multiple rows
28-
emitted as `sql::types::Rows` row iterators:
34+
During execution, a row of values is represented as `sql::types::Row`, with multiple rows emitted
35+
via `sql::types::Rows` row iterators:
2936

3037
https://github.com/erikgrinaker/toydb/blob/b2fe7b76ee634ca6ad31616becabfddb1c03d34b/src/sql/types/value.rs#L378-L388
3138

3239
## Schemas
3340

34-
toyDB schemas support a single object: a table. There's only a single, unnamed database, and no
35-
named indexes, constraints, or other schema objects.
41+
toyDB schemas only support tables. There are no named indexes or constraints, and there's only a
42+
single unnamed database.
3643

3744
Tables are represented by `sql::types::Table`:
3845

@@ -47,42 +54,41 @@ The table name serves as a unique identifier, and can't be changed later. In fac
4754
are entirely static: they can only be created or dropped (there are no schema changes).
4855

4956
Table schemas are stored in the catalog, represented by the `sql::engine::Catalog` trait. We'll
50-
revisit the implementation of this trait in the storage section below.
57+
revisit the implementation of this trait in the SQL storage section.
5158

5259
https://github.com/erikgrinaker/toydb/blob/0839215770e31f1e693d5cccf20a68210deaaa3f/src/sql/engine/engine.rs#L60-L79
5360

54-
Table schemas are validated (e.g. during creation) via the `Table::validate()` method, which
55-
enforces invariants and internal consistency. It uses the catalog to look up information about other
56-
tables, e.g. that foreign key references point to a valid target column.
61+
Table schemas are validated when created via `Table::validate()`, which enforces invariants and
62+
internal consistency. It uses the catalog to look up information about other tables, e.g. that
63+
foreign key references point to a valid target column in a different table.
5764

5865
https://github.com/erikgrinaker/toydb/blob/c2b0f7f1d6cbf6e2cdc09fc0aec7b050e840ec21/src/sql/types/schema.rs#L98-L170
5966

60-
It also has a `Table::validate_row()` method which is used to validate that a given
61-
`sql::types::Row` conforms to the schema (e.g. that the value data types match the column data
62-
types). It uses a `sql::engine::Transaction` to look up other rows in the database, e.g. to check
63-
for primary key conflicts (we'll get back to this below).
67+
Table rows are validated via `Table::validate_row()`, which ensures that a `sql::types::Row`
68+
conforms to the schema (e.g. that value types match the column data types). It uses a
69+
`sql::engine::Transaction` to look up other rows in the database, e.g. to check for primary key
70+
conflicts (we'll get back to this later).
6471

6572
https://github.com/erikgrinaker/toydb/blob/c2b0f7f1d6cbf6e2cdc09fc0aec7b050e840ec21/src/sql/types/schema.rs#L172-L236
6673

6774
## Expressions
6875

6976
During SQL execution, we also have to model _expressions_, such as `1 + 2 * 3`. These are
70-
represented as values and operations on them. They can be nested arbitrarily as a tree to represent
71-
compound operations.
77+
represented as values and operations on them, and can be nested as a tree to represent compound
78+
operations.
7279

7380
https://github.com/erikgrinaker/toydb/blob/9419bcf6aededf0e20b4e7485e2a5fa3e975d79f/src/sql/types/expression.rs#L11-L64
7481

7582

76-
For example:
83+
For example, the expression `1 + 2 * 3` (taking [precedence](https://en.wikipedia.org/wiki/Order_of_operations)
84+
into account) is represented as:
7785

7886
```rust
79-
// 1 + 2 * 3 is represented as:
80-
//
81-
// +
82-
// / \
83-
// 1 *
84-
// / \
85-
/// 2 3
87+
// +
88+
// / \
89+
// 1 *
90+
// / \
91+
/// 2 3
8692
Expression::Add(
8793
Expression::Constant(Value::Integer(1)),
8894
Expression::Multiply(
@@ -97,8 +103,8 @@ An `Expression` can contain two kinds of values: constant values as
97103
references. The latter will fetch a `sql::types::Value` from a `sql::types::Row` at the specified
98104
index during evaluation.
99105

100-
We'll see later how the SQL parser and planner transforms text expressions like `1 + 2 * 3` into
101-
this `Expression` form, and how it resolves column names to row indexes -- e.g. `price * 0.25` to
106+
We'll see later how the SQL parser and planner transforms text expression like `1 + 2 * 3` into an
107+
`Expression`, and how it resolves column names to row indexes like `price * 0.25` to
102108
`row[3] * 0.25`.
103109

104110
Expressions are evaluated recursively via `Expression::evalute()`, given a `sql::types::Row` with

docs/architecture/sql-execution.md

Lines changed: 14 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,44 +1,45 @@
11
# SQL Execution
22

3-
Ok, now that the planner and optimizer has done all the hard work of figuring out how to execute a
3+
Now that the planner and optimizer have done all the hard work of figuring out how to execute a
44
query, it's time to actually execute it.
55

66
## Plan Executor
77

88
Plan execution is done by `sql::execution::Executor` in the
99
[`sql::execution`](https://github.com/erikgrinaker/toydb/tree/9419bcf6aededf0e20b4e7485e2a5fa3e975d79f/src/sql/execution)
10-
module, using a `sql::engine::Transaction` to perform read/write operations on the SQL engine.
10+
module, using a `sql::engine::Transaction` to access the SQL storage engine.
1111

1212
https://github.com/erikgrinaker/toydb/blob/213e5c02b09f1a3cac6a8bbd0a81773462f367f5/src/sql/execution/executor.rs#L14-L49
1313

1414
The executor takes a `sql::planner::Plan` as input, and will return an `ExecutionResult` depending
1515
on the statement type.
1616

17-
https://github.com/erikgrinaker/toydb/blob/213e5c02b09f1a3cac6a8bbd0a81773462f367f5/src/sql/execution/executor.rs#L330-L338
17+
https://github.com/erikgrinaker/toydb/blob/686d3971a253bfc9facc2ba1b0e716cff5c109fb/src/sql/execution/executor.rs#L331-L339
1818

1919
When executing the plan, the executor will branch off depending on the statement type:
2020

21-
https://github.com/erikgrinaker/toydb/blob/213e5c02b09f1a3cac6a8bbd0a81773462f367f5/src/sql/execution/executor.rs#L56-L100
21+
https://github.com/erikgrinaker/toydb/blob/686d3971a253bfc9facc2ba1b0e716cff5c109fb/src/sql/execution/executor.rs#L57-L101
2222

23-
We'll focus on `SELECT` queries here, which is the most interesting.
23+
We'll focus on `SELECT` queries here, which are the most interesting.
2424

25-
toyDB uses the iterator model (also known as the volcano model) for query execution. In the case
26-
of a `SELECT` query, the result is a result row iterator, and pulling from this iterator by calling
27-
`next()` will drive the entire execution pipeline. This maps very naturally onto Rust's iterators,
28-
and we leverage these to construct the execution pipeline as nested iterators.
25+
toyDB uses the iterator model (also known as the volcano model) for query execution. In the case of
26+
a `SELECT` query, the result is a row iterator, and pulling from this iterator by calling `next()`
27+
will drive the entire execution pipeline by recursively calling `next()` on the child node results.
28+
This maps very naturally onto Rust's iterators, and we leverage these to construct the execution
29+
pipeline as nested iterators.
2930

3031
Execution itself is fairly straightforward, since we're just doing exactly what the planner tells us
3132
to do in the plan. We call `Executor::execute_node` recursively on each `sql::planner:Node`,
3233
starting with the root node. Each node returns a result row iterator that the parent node can pull
3334
its input rows from, process them, and output the resulting rows via its own row iterator (with the
3435
root node's iterator being returned to the caller):
3536

36-
https://github.com/erikgrinaker/toydb/blob/213e5c02b09f1a3cac6a8bbd0a81773462f367f5/src/sql/execution/executor.rs#L102-L103
37+
https://github.com/erikgrinaker/toydb/blob/686d3971a253bfc9facc2ba1b0e716cff5c109fb/src/sql/execution/executor.rs#L103-L104
3738

38-
`Executor::execute_node` will simply look at the type of `Node`, recursively call
39-
`Executor::execute_node` on any child nodes, and then process the rows accordingly.
39+
`Executor::execute_node()` will simply look at the type of `Node`, recursively call
40+
`Executor::execute_node()` on any child nodes, and then process the rows accordingly.
4041

41-
https://github.com/erikgrinaker/toydb/blob/213e5c02b09f1a3cac6a8bbd0a81773462f367f5/src/sql/execution/executor.rs#L102-L211
42+
https://github.com/erikgrinaker/toydb/blob/686d3971a253bfc9facc2ba1b0e716cff5c109fb/src/sql/execution/executor.rs#L103-L212
4243

4344
We won't discuss every plan node in details, but let's consider the movie plan we've looked at
4445
previously:

docs/architecture/sql-optimizer.md

Lines changed: 22 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -93,15 +93,14 @@ Additionally, `ConstantFolding` also short-circuits logical expressions. For exa
9393

9494
https://github.com/erikgrinaker/toydb/blob/213e5c02b09f1a3cac6a8bbd0a81773462f367f5/src/sql/planner/optimizer.rs#L58-L84
9595

96-
As the code comment mentions though, this doesn't fold as far as possible. It doesn't attempt to
97-
rearrange expressions, which would require knowledge of precedence rules. For example,
98-
`(1 + foo) - 2` could be folded into `foo - 1` by first rearranging it as `foo + (1 - 2)`, but we
99-
don't do this currently.
96+
As the code comment mentions though, this doesn't fold optimally: it doesn't attempt to rearrange
97+
expressions, which would require knowledge of precedence rules. For example, `(1 + foo) - 2` could
98+
be folded into `foo - 1` by first rearranging it as `foo + (1 - 2)`, but we don't do this currently.
10099

101100
## Filter Pushdown
102101

103102
The `FilterPushdown` optimizer attempts to push filter predicates as far down into the plan as
104-
possible, to reduce the amount of work we do.
103+
possible, to reduce the number of rows each node has to process.
105104

106105
https://github.com/erikgrinaker/toydb/blob/213e5c02b09f1a3cac6a8bbd0a81773462f367f5/src/sql/planner/optimizer.rs#L90-L95
107106

@@ -117,10 +116,10 @@ Select
117116
└─ Scan: genres
118117
```
119118

120-
Even though we're filtering on `release >= 2000`, the `Scan` node still has to read all of them
121-
from disk and send them via Raft, and the `NestedLoopJoin` node still has to join all of them.
122-
It would be nice if we could push this filtering into into the `NestedLoopJoin` and `Scan` nodes
123-
and avoid this work, which is exactly what `FilterPushdown` does.
119+
Even though we're filtering on `release >= 2000`, the `Scan` node still has to read all of them from
120+
disk and send them via Raft, and the `NestedLoopJoin` node still has to join all of them. It would
121+
be nice if we could push this filtering into the `NestedLoopJoin` and `Scan` nodes and avoid this
122+
extra work, and this is exactly what `FilterPushdown` does.
124123

125124
The only plan nodes that have predicates that can be pushed down are `Filter` nodes and
126125
`NestedLoopJoin` nodes, so we recurse through the plan tree and look for these nodes, attempting
@@ -159,10 +158,9 @@ discussed previously. This allows us to examine and push down each AND part in i
159158
has the same effect regardless of whether it is evaluated in the `NestedLoopJoin` node or one of
160159
the source nodes. Our expression is already in conjunctive normal form, though.
161160

162-
We then look at each AND part, and check which side of the join they have column references for.
163-
If they only reference one of the sides, then the expression can be pushed down into it. We also
164-
make some effort here to move primary/foreign key constants across to both sides, but we'll gloss
165-
over that.
161+
We then look at each AND part, and check which side of the join it has column references for. If it
162+
only references one of the sides, then the expression can be pushed down into it. We also make some
163+
effort here to move primary/foreign key constants across to both sides, but we'll gloss over that.
166164

167165
https://github.com/erikgrinaker/toydb/blob/213e5c02b09f1a3cac6a8bbd0a81773462f367f5/src/sql/planner/optimizer.rs#L155-L247
168166

@@ -213,7 +211,7 @@ The code is as outlined above:
213211

214212
https://github.com/erikgrinaker/toydb/blob/213e5c02b09f1a3cac6a8bbd0a81773462f367f5/src/sql/planner/optimizer.rs#L254-L303
215213

216-
Helped by `Expression::is_column_lookup` and `Expression::into_column_values`:
214+
Helped by `Expression::is_column_lookup()` and `Expression::into_column_values()`:
217215

218216
https://github.com/erikgrinaker/toydb/blob/9419bcf6aededf0e20b4e7485e2a5fa3e975d79f/src/sql/types/expression.rs#L363-L421
219217

@@ -227,14 +225,13 @@ A [nested loop join](https://en.wikipedia.org/wiki/Nested_loop_join) is a very i
227225
algorithm, which iterates over all rows in the right source for each row in the left source to see
228226
if they match. However, it is completely general, and can join on arbitraily complex predicates.
229227

230-
In the common case where the join predicate is an equality check (i.e. an
231-
[equijoin](https://en.wikipedia.org/wiki/Relational_algebra#θ-join_and_equijoin)), such as
232-
`movies.genre_id = genres.id`, then we can instead use a
233-
[hash join](https://en.wikipedia.org/wiki/Hash_join). This scans the right table once, builds an
234-
in-memory hash table from it, and for each left row it looks up any right rows in the hash table.
235-
This is a much more efficient O(n) algorithm.
228+
In the common case where the join predicate is an equality comparison such as
229+
`movies.genre_id = genres.id` (i.e. an [equijoin](https://en.wikipedia.org/wiki/Relational_algebra#θ-join_and_equijoin)),
230+
then we can instead use a [hash join](https://en.wikipedia.org/wiki/Hash_join). This scans the right
231+
table once, builds an in-memory hash table from it, and for each left row it looks up any right rows
232+
in the hash table. This is a much more efficient O(n) algorithm.
236233

237-
In our previous movie example, we are in fact doing an equijoin, and so our `NestedLoopJoin`:
234+
In our previous movie example, we are in fact doing an equijoin:
238235

239236
```
240237
Select
@@ -245,7 +242,7 @@ Select
245242
└─ Scan: genres
246243
```
247244

248-
Will be replaced by a `HashJoin`:
245+
And so our `NestedLoopJoin` can be replaced by a `HashJoin`:
249246

250247
```
251248
Select
@@ -263,15 +260,15 @@ hash table), but we keep it simple.
263260
https://github.com/erikgrinaker/toydb/blob/213e5c02b09f1a3cac6a8bbd0a81773462f367f5/src/sql/planner/optimizer.rs#L309-L348
264261

265262
Of course there are many other join algorithms out there, and one of the harder problems in SQL
266-
optimization is how to efficiently perform deep multijoins. We don't attempt to tackle these
263+
optimization is how to efficiently perform large N-way multijoins. We don't attempt to tackle these
267264
problems here -- the `HashJoin` optimizer is just a very simple example of such join optimization.
268265

269266
## Short Circuiting
270267

271268
The `ShortCircuit` optimizer tries to find nodes that can't possibly do any useful work, and either
272269
removes them from the plan, or replaces them with trivial nodes that don't do anything. It is kind
273-
of similar to the `ConstantFolding` optimizer in spirit, but works at the plan node level rather
274-
than the expression node level.
270+
of similar to the `ConstantFolding` optimizer in spirit, but works on plan nodes rather than
271+
expression nodes.
275272

276273
https://github.com/erikgrinaker/toydb/blob/213e5c02b09f1a3cac6a8bbd0a81773462f367f5/src/sql/planner/optimizer.rs#L350-L354
277274

docs/architecture/sql-parser.md

Lines changed: 11 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# SQL Parsing
22

3-
And so we finally arrive at SQL. The SQL parser is the first stage in processing SQL
4-
queries and statements, located in the [`src/sql/parser`](https://github.com/erikgrinaker/toydb/tree/39c6b60afc4c235f19113dc98087176748fa091d/src/sql/parser)
3+
We finally arrive at SQL. The SQL parser is the first stage in processing SQL queries and
4+
statements, located in the [`sql::parser`](https://github.com/erikgrinaker/toydb/tree/39c6b60afc4c235f19113dc98087176748fa091d/src/sql/parser)
55
module.
66

77
The SQL parser's job is to take a raw SQL string and turn it into a structured form that's more
@@ -99,7 +99,7 @@ string are well-formed. For example, the following input string:
9999
Will result in these tokens:
100100

101101
```
102-
String("foo"), CloseParen, Number("3.14"), Keyword(Select), Plus, Ident("x")
102+
String("foo") CloseParen Number("3.14") Keyword(Select) Plus Ident("x")
103103
```
104104

105105
Tokens and keywords are represented by the `sql::parser::Token` and `sql::parser::Keyword` enums
@@ -137,12 +137,14 @@ kinds of SQL statements that we support, along with their contents:
137137

138138
https://github.com/erikgrinaker/toydb/blob/39c6b60afc4c235f19113dc98087176748fa091d/src/sql/parser/ast.rs#L6-L145
139139

140-
The nested tree structure is particularly apparent with _expressions_ -- these represent values and
141-
operations which will eventually _evaluate_ to a single value. For example, the expression
142-
`2 * 3 - 4 / 2`, which evaluates to the value `4`.
140+
The nested tree structure is particularly apparent with expressions, which represent values and
141+
operations on them. For example, the expression `2 * 3 - 4 / 2`, which evaluates to the value `4`.
143142

144-
These expressions are represented as `sql::parser::ast::Expression`, and can be nested indefinitely
145-
into a tree structure.
143+
We've seen in the data model section how such expressions are represented as
144+
`sql::types::Expression`, but before we get there we have to parse them. The parser has its own
145+
representation `sql::parser::ast::Expression` -- this is necessary e.g. because in the AST, we
146+
represent columns as names rather than numeric indexes (we don't know yet which columns exist or
147+
what their names are, we'll get to that during planning).
146148

147149
https://github.com/erikgrinaker/toydb/blob/39c6b60afc4c235f19113dc98087176748fa091d/src/sql/parser/ast.rs#L147-L170
148150

@@ -215,7 +217,7 @@ than that of the operators preceding them (hence "precedence climbing"). For exa
215217
2 * 3 - 4 / 2
216218
```
217219

218-
The algorithm is documented in more detail on `Parser::parse_expression`:
220+
The algorithm is documented in more detail on `Parser::parse_expression()`:
219221

220222
https://github.com/erikgrinaker/toydb/blob/39c6b60afc4c235f19113dc98087176748fa091d/src/sql/parser/parser.rs#L501-L696
221223

0 commit comments

Comments
 (0)