Faster parser

Mypy uses the Python stdlib `ast` module for parsing. Currently parsing can take from 20% to 30% of the runtime when type checking without using the incremental cache. When processing third-party dependencies, the fraction tends to be higher, since function bodies get processed during parsing but won't be type checked.

A Python-level AST is currently constructed more than once. First the `ast` module constructs an AST, and this is converted into a mypy AST. If we'd use a different parser, we could only build a Python-level AST once. (`ast` also first constructs a C-level AST, but this is probably relatively fast.)

It seems to me that the most promising approach is to adapt an existing parser library written in a lower-level language such as C or Rust (I will call this the "low-level parser"). We'd generate a serialized AST in the low-level parser using the efficient binary format we now support for incremental cache files, and mypy would deserialize the output directly into a mypy AST. I'm hoping that this would improve parsing efficiency (including time needed to build the mypy AST) by 50-100%. Overall mypy performance would be improved by roughly 5% to 15%, and possibly more once we further optimize other parts of mypy.

The deserialization based approach also has a few other benefits:
* We can run the low-level parser in parallel by using the `threading` module, on all supported Python versions, since it doesn't need to hold the GIL during parsing.
* If the low-level parser can also produce a list of reachable import statements, we can build a module dependency graph without building a mypy AST. This would simplify parallel type checking using multiple processes.
* It will be easy to only deserialize a subset of the AST for a module. We could perhaps use this to lazily process imported modules, or to implement incremental parsing in mypy daemon.
* We can hopefully drop most function and method bodies in third-party dependencies without allocating a Python AST for them.

The Ruff parser and tree-sitter are the low-level parser options I've looked at so far. Based on some quick benchmarking, the Ruff parser appears to be much faster. The tree-sitter parser seems slow enough that it might not bring us significant performance gains (at least without multi-threading).

Integrating a new parser will be a substantial amount of work, so we should first do some prototyping to make sure the performance gains are feasible.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Faster parser #19776

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Faster parser #19776

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions