Skip to content

Language classification should be configurable via plugin #2768

Open
@joshgoebel

Description

@joshgoebel

Initial implementation thoughts:

  • before:autoDetect({code, languageSubset, isSublanguage})
  • after:autoDetect({code, languageSubset, isSublanguage, results})

Before auto detect the plugin can take a look at the code and alter the languageSubset (add or remove languages). Perhaps it rules out certain languages, or wants to only auto-detect certain languages rarely if it matches a special pattern of it's own. This allows for reducing (or increasing) the effort it takes to do an auto-detect pass. before:autoDetect is what you would use if you wanted to write say a shebang classifier... i.e. if you see #!/usr/bin/node then just immediately restrict to javascript without the cost of analyzing the code with 50 different grammars first.

After auto-detection the plugin may access the list of results (just an array of results per language from the call to highlight... it could individually alter those results, increase or decrease their relevance. This would be where you'd "boost" certain language or provide additional context to the highlighter... ie, we're 90% sure this is JS, so it's relevance could be boosted to prevent false positives from other grammars.

Finally for highlightAuto's existing API the results would packaged into best and second_best (as they always have) and returned to the client. Or I suppose we could allow the after hook to add their own result key which would just be returned verbatim...

So what kind of things does this allow:

  • A before hook could decide not to auto-highlight at all by reducing the language list to an empty array. (leaving only plaintext to "win")
  • A before hook could replace the whole classification process with a custom one... i.e, an alternative strategy could be used to decide "what is this language" and then the languageSubset would be reduced to a single language. Although that could still tie with plaintext, though I'm not super concerned about this edge case until it comes up in the real world.
  • After hooks could use all sorts of strategies to tweak the order/relevance of any of the results... including very complex logic like considering how one language compares to another... ie, "This code scores highly with all 4 different SQL grammars, so maybe I'll make some strategic decision based on that fact".

Is your request related to a specific problem you're having?

Yes, auto-detect will guess very wildly when it only matches a small number of rules.

The solution you'd prefer / feature you'd like to see added...

IE, a relevancy score of less than 5 perhaps should not be considered at all, resulting in rendering as plaintext rather than essentially making a "wild guess" and rendering incorrectly/poorly.

Any alternative solutions you considered...

This threshold could be configurable, though I'm not convinced it needs to be. This could also be implemented by making our "classifier" a plugin (and is likely a far more flexible solution). IE, autoDetect would only do the "raw" scoring and then pass the results to a "classify" plugin that would make final determinations regarding winners, losers... Currently the API only returns the highest match and second_best. So the classifier would (for compatibility) need to follow that API, although there would be nothing from stopping the classifier from adding extra meta-data that could be used by custom implementations.

The benefit of just supporting the existing API would be that highlightBlock() and initHighlightingOnLoad() would continue to "just work" with 0 required changes, other than registering your API.

Secondary benefits on a plugin based approach

Instead of just "don't highlight after wild guesses" (as a feature) it becomes very easy to "improve" the language detection GREATLY based on additional context you have available that the highlighter does not... IE, classification based on filename, StackOverflow post tags, lunar cycle, etc...

Additional context...

https://meta.stackexchange.com/questions/354793/improving-syntax-highlighting-language-auto-detection/355496#355496

Metadata

Metadata

Assignees

No one assigned

    Labels

    discuss/proposeProposal for a new feature/directionenhancementAn enhancement or new featurehelp welcomeCould use help from communitypluginSpecific plugin or plugin discussion

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions