Skip to content

Faster parser #19776

@JukkaL

Description

@JukkaL

Mypy uses the Python stdlib ast module for parsing. Currently parsing can take from 20% to 30% of the runtime when type checking without using the incremental cache. When processing third-party dependencies, the fraction tends to be higher, since function bodies get processed during parsing but won't be type checked.

A Python-level AST is currently constructed more than once. First the ast module constructs an AST, and this is converted into a mypy AST. If we'd use a different parser, we could only build a Python-level AST once. (ast also first constructs a C-level AST, but this is probably relatively fast.)

It seems to me that the most promising approach is to adapt an existing parser library written in a lower-level language such as C or Rust (I will call this the "low-level parser"). We'd generate a serialized AST in the low-level parser using the efficient binary format we now support for incremental cache files, and mypy would deserialize the output directly into a mypy AST. I'm hoping that this would improve parsing efficiency (including time needed to build the mypy AST) by 50-100%. Overall mypy performance would be improved by roughly 5% to 15%, and possibly more once we further optimize other parts of mypy.

The deserialization based approach also has a few other benefits:

  • We can run the low-level parser in parallel by using the threading module, on all supported Python versions, since it doesn't need to hold the GIL during parsing.
  • If the low-level parser can also produce a list of reachable import statements, we can build a module dependency graph without building a mypy AST. This would simplify parallel type checking using multiple processes.
  • It will be easy to only deserialize a subset of the AST for a module. We could perhaps use this to lazily process imported modules, or to implement incremental parsing in mypy daemon.
  • We can hopefully drop most function and method bodies in third-party dependencies without allocating a Python AST for them.

The Ruff parser and tree-sitter are the low-level parser options I've looked at so far. Based on some quick benchmarking, the Ruff parser appears to be much faster. The tree-sitter parser seems slow enough that it might not bring us significant performance gains (at least without multi-threading).

Integrating a new parser will be a substantial amount of work, so we should first do some prototyping to make sure the performance gains are feasible.

Metadata

Metadata

Assignees

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions