This repo contains the implementation of several Machine Learning algorithms for Named Entity Recognition. We build, train and evaluate them on many different dataset, considering several aspects: quality of prediction, memory consumption, and latency of inference.
See environment.yml. In general, I used tensorflow.keras and scikit-learn for my ML experiments 🔮.
conda env create -f environment.yml
conda activate ner-suiteYou can now play with the notebooks!
data/: directory in which are saved all the dataset used in the notebooks. The dataset are:- CoNLL03;
- Annotated Corpus for NER;
- WikiNER (english and italian);
embeddings/: directory that contains different word embeddings:glove.6B.100d.txtfor english;w2v.itWac.128d.txtfor italian;
utils: a package that I made in order to increase code modularity, reusability and readability;<algo>-<dataset>.ipynb: these are the notebooks with the experiments that we made;environment.yml: conda environment file in order to replicate the environment on your machine and reproduce the experiments;results.xlsx: results of the experiments;
- Conditional Random Fields: a traditional Machine Learning algorithm which can deal with sequences. Refer to the original paper and the implementation of the sklearn wrapper;
- LSTM: the most used recurrent neural network for modeling sequences. We also use it in combination with pre-trained embeddings like GloVe and itWac;
- End-to-end model: in this paper it is proposed a model which combines a CNN to extract morphological features from the characters of the word, the GloVe embeddings to represent word-level features, a Bidirectional LSTM to model the context and finally a CRF layer to decode the best sequence of labels. We implemented it, thanks to the work already done in this repo.
- Improve documentation;