Language detection accuracy measurements

Enry right now consist of the sequence matching of strategies that narrow down the possible language options based on different available information:
 - finelame + extension
 - first line of the content
 - regexp heuristics of the raw content
 - naive bayesian classifier of the tokenized content

As a users, as each strategy can be used independently, I would like to know **how accurate will the language detection be** for each of the distinct use cases.

### Use cases
 - all strategies together (default)
 - filename-only language detection
 - content-only language detection

### Evaluation
Right now, the only measure of overall accuracy of language detection process we have is binary (similar to linguist): if the `linguist/examples/` are all classified or not.

This issue is about picking a better way of quantifying the prediction quality for the three use cases above.

### Steps
 - [x] identify a small dataset, to evaluate up on [smola/language-dataset](https://github.com/smola/language-dataset/tree/master/data) (from https://github.com/smola/language-dataset/issues/3)
 - [ ] a notebook with PoC of the evaluation, to pick the best metric (using Python API from #154)
 - [ ] a script that runs entry for each use case on this dataset \w a given metric  (e.g as part of CI)

The focus of this task is not to get best possible evaluation, but rather to quickly kick off the automation of having at least some evaluation, that will be improved in subsequent work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Language detection accuracy measurements #246

Use cases

Evaluation

Steps

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Language detection accuracy measurements #246

Description

Use cases

Evaluation

Steps

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions