Skip to content

Use native implementation of difflib #4484

@JonoYang

Description

@JonoYang

Using pyinstrument and scanning the the test file meta-quest-oss-notice.md for licenses, we have found that a lot of time is spent in the sequence matching portion:

11.902 get_licenses  scancode/api.py:150
├─ 11.858 detect_licenses  licensedcode/detection.py:2180
│  ├─ 9.619 LicenseIndex.match  licensedcode/index.py:892
│  │  ├─ 9.469 LicenseIndex.match_query  licensedcode/index.py:960
│  │  │  ├─ 8.915 LicenseIndex.get_approximate_matches  licensedcode/index.py:718
│  │  │  │  ├─ 6.753 LicenseIndex.get_query_run_approximate_matches  licensedcode/index.py:808
│  │  │  │  │  ├─ 6.360 match_sequence  licensedcode/match_seq.py:48
│  │  │  │  │  │  ├─ 5.469 match_blocks  licensedcode/seq.py:107
│  │  │  │  │  │  │  ├─ 5.446 find_longest_match  licensedcode/seq.py:19
│  │  │  │  │  │  │  │  ├─ 5.289 [self]  licensedcode/seq.py
│  │  │  │  │  │  │  │  ├─ 0.112 dict.get  <built-in>
│  │  │  │  │  │  │  │  └─ 0.045 extend_match  licensedcode/seq.py:84
│  │  │  │  │  │  │  │     ├─ 0.037 [self]  licensedcode/seq.py
│  │  │  │  │  │  │  │     └─ 0.008 <lambda>  <string>:1
│  │  │  │  │  │  │  │        ├─ 0.007 [self]  <string>
│  │  │  │  │  │  │  │        └─ 0.001 tuple.__new__  <built-in>

Currently, we use a pure python implementation of difflib to perform license detection (https://github.com/aboutcode-org/scancode-toolkit/blob/develop/src/licensedcode/seq.py) A way to improve the performance of this part of license detection would be to use a native implementation of difflib like https://pypi.org/project/cdifflib/ or https://github.com/rapidfuzz/CyDifflib

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions