Comparison of Entity Extraction techniques for annotation

This repo contains the implementation of several Machine Learning algorithms for Named Entity Recognition. We build, train and evaluate them on many different dataset, considering several aspects: quality of prediction, memory consumption, and latency of inference.

Dependencies

See environment.yml. In general, I used tensorflow.keras and scikit-learn for my ML experiments 🔮.

Setup

conda env create -f environment.yml
conda activate ner-suite

You can now play with the notebooks!

Project Structure

data/: directory in which are saved all the dataset used in the notebooks. The dataset are:
- CoNLL03;
- Annotated Corpus for NER;
- WikiNER (english and italian);
embeddings/: directory that contains different word embeddings:
- glove.6B.100d.txt for english;
- w2v.itWac.128d.txt for italian;
utils: a package that I made in order to increase code modularity, reusability and readability;
<algo>-<dataset>.ipynb: these are the notebooks with the experiments that we made;
environment.yml: conda environment file in order to replicate the environment on your machine and reproduce the experiments;
results.xlsx: results of the experiments;

Models and references:

Conditional Random Fields: a traditional Machine Learning algorithm which can deal with sequences. Refer to the original paper and the implementation of the sklearn wrapper;
LSTM: the most used recurrent neural network for modeling sequences. We also use it in combination with pre-trained embeddings like GloVe and itWac;
End-to-end model: in this paper it is proposed a model which combines a CNN to extract morphological features from the characters of the word, the GloVe embeddings to represent word-level features, a Bidirectional LSTM to model the context and finally a CRF layer to decode the best sequence of labels. We implemented it, thanks to the work already done in this repo.

TODO

Improve documentation;

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
embeddings		embeddings
utils		utils
CNN+BiLSTM+CRF-ACNER.ipynb		CNN+BiLSTM+CRF-ACNER.ipynb
CNN+BiLSTM+CRF-CoNLL.ipynb		CNN+BiLSTM+CRF-CoNLL.ipynb
CNN+BiLSTM+CRF-WiNER-EN.ipynb		CNN+BiLSTM+CRF-WiNER-EN.ipynb
CNN+BiLSTM+CRF-WiNER-IT.ipynb		CNN+BiLSTM+CRF-WiNER-IT.ipynb
CRF-ACNER.ipynb		CRF-ACNER.ipynb
CRF-CoNLL.ipynb		CRF-CoNLL.ipynb
CRF-WiNER-EN.ipynb		CRF-WiNER-EN.ipynb
CRF-WiNER-IT.ipynb		CRF-WiNER-IT.ipynb
LSTM-ACNER.ipynb		LSTM-ACNER.ipynb
LSTM-CoNLL.ipynb		LSTM-CoNLL.ipynb
LSTM-WiNER-EN.ipynb		LSTM-WiNER-EN.ipynb
LSTM-WiNER-IT.ipynb		LSTM-WiNER-IT.ipynb
README.md		README.md
end2end_model.png		end2end_model.png
environment.yml		environment.yml
results.xlsx		results.xlsx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Comparison of Entity Extraction techniques for annotation

Dependencies

Setup

Project Structure

Models and references:

TODO

About

Uh oh!

Releases

Packages

Languages

SimoneRichetti/NER-comparison

Folders and files

Latest commit

History

Repository files navigation

Comparison of Entity Extraction techniques for annotation

Dependencies

Setup

Project Structure

Models and references:

TODO

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages