Language classification should be configurable via plugin

**Initial implementation thoughts:**

- `before:autoDetect({code, languageSubset, isSublanguage})`
- `after:autoDetect({code, languageSubset, isSublanguage, results})`

Before auto detect the plugin can take a look at the code and alter the languageSubset (add or remove languages).  Perhaps it rules out certain languages, or wants to only auto-detect certain languages rarely if it matches a special pattern of it's own.  This allows for reducing (or increasing) the effort it takes to do an auto-detect pass.  `before:autoDetect` is what you would use if you wanted to write say a shebang classifier... i.e. if you see `#!/usr/bin/node` then just immediately restrict to `javascript` **without** the cost of analyzing the code with 50 different grammars first.

After auto-detection the plugin may access the list of results (just an array of results per language from the call to `highlight`... it could individually alter those results, increase or decrease their relevance.  This would be where you'd "boost" certain language or provide additional context to the highlighter... ie, we're 90% sure this is JS, so it's relevance could be boosted to prevent false positives from other grammars.

Finally for highlightAuto's existing API the results would packaged into best and second_best (as they always have) and returned to the client.  Or I suppose we could allow the `after` hook to add their own `result` key which would just be returned verbatim...

_So what kind of things does this allow:_

- A before hook could decide not to auto-highlight at all by reducing the language list to an empty array. (leaving only plaintext to "win")
- A before hook could replace the whole classification process with a custom one... i.e, an alternative strategy could be used to decide "what is this language" and then the `languageSubset` would be reduced to a single language. Although that could still tie with plaintext, though I'm not super concerned about this edge case until it comes up in the real world.
- After hooks could use all sorts of strategies to tweak the order/relevance of any of the results... including very complex logic like considering how one language compares to another... ie, "This code scores highly with all 4 different SQL grammars, so maybe I'll make some strategic decision based on that fact".


---

**Is your request related to a specific problem you're having?**

Yes, auto-detect will *guess very wildly* when it only matches a small number of rules.  

**The solution you'd prefer / feature you'd like to see added...**

IE, a relevancy score of less than 5 perhaps should not be considered at all, resulting in rendering as plaintext rather than essentially making a "wild guess" and rendering incorrectly/poorly.

**Any alternative solutions you considered...**

This threshold could be configurable, though I'm not convinced it needs to be.  This could also be implemented by making our "classifier" a plugin (and is likely a far more flexible solution).  IE, `autoDetect` would only do the "raw" scoring and then pass the results to a "classify" plugin that would make final determinations regarding winners, losers...  Currently the API only returns the highest match and `second_best`. So the classifier would (for compatibility) need to follow that API, although there would be nothing from stopping the classifier from adding extra meta-data that could be used by custom implementations.

The benefit of just supporting the existing API would be that `highlightBlock()` and `initHighlightingOnLoad()` would continue to "just work" with 0 required changes, other than registering your API.

**Secondary benefits on a plugin based approach**

Instead of just "don't highlight after wild guesses" (as a feature) it becomes very easy to "improve" the language detection GREATLY based on additional context you have available that the highlighter does not... IE, classification based on filename, StackOverflow post tags, lunar cycle, etc...

**Additional context...**

https://meta.stackexchange.com/questions/354793/improving-syntax-highlighting-language-auto-detection/355496#355496


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Language classification should be configurable via plugin #2768

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Language classification should be configurable via plugin #2768

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions