-
-
Notifications
You must be signed in to change notification settings - Fork 600
Open
Labels
Description
Using pyinstrument and scanning the the test file meta-quest-oss-notice.md for licenses, we have found that a lot of time is spent in the sequence matching portion:
11.902 get_licenses scancode/api.py:150
├─ 11.858 detect_licenses licensedcode/detection.py:2180
│ ├─ 9.619 LicenseIndex.match licensedcode/index.py:892
│ │ ├─ 9.469 LicenseIndex.match_query licensedcode/index.py:960
│ │ │ ├─ 8.915 LicenseIndex.get_approximate_matches licensedcode/index.py:718
│ │ │ │ ├─ 6.753 LicenseIndex.get_query_run_approximate_matches licensedcode/index.py:808
│ │ │ │ │ ├─ 6.360 match_sequence licensedcode/match_seq.py:48
│ │ │ │ │ │ ├─ 5.469 match_blocks licensedcode/seq.py:107
│ │ │ │ │ │ │ ├─ 5.446 find_longest_match licensedcode/seq.py:19
│ │ │ │ │ │ │ │ ├─ 5.289 [self] licensedcode/seq.py
│ │ │ │ │ │ │ │ ├─ 0.112 dict.get <built-in>
│ │ │ │ │ │ │ │ └─ 0.045 extend_match licensedcode/seq.py:84
│ │ │ │ │ │ │ │ ├─ 0.037 [self] licensedcode/seq.py
│ │ │ │ │ │ │ │ └─ 0.008 <lambda> <string>:1
│ │ │ │ │ │ │ │ ├─ 0.007 [self] <string>
│ │ │ │ │ │ │ │ └─ 0.001 tuple.__new__ <built-in>
Currently, we use a pure python implementation of difflib to perform license detection (https://github.com/aboutcode-org/scancode-toolkit/blob/develop/src/licensedcode/seq.py) A way to improve the performance of this part of license detection would be to use a native implementation of difflib like https://pypi.org/project/cdifflib/ or https://github.com/rapidfuzz/CyDifflib