Monolingual NER Results for various Languages

The Neural NER system implemented by me as part of the papers TALLIP paper and ACL 2018 Paper achieves the following F1-Scores on various languages.

Results

Language Dataset Word Embeddings Reference F1 Score
English CoNLL 2003 Spectral Embeddings Arxiv Paper 90.94
Spanish CoNLL 2002 Spectral Embeddings Arxiv Paper 85.75
Dutch CoNLL 2002 Spectral Embeddings Arxiv Paper 85.20
German Link Spectral Embeddings ACL 2018 Paper 87.64
Italian Evalita 2009 Spectral Embeddings ACL 2018 Paper 75.98
Hindi FIRE 2014 Fasttext Embeddings ACL 2018 Paper 64.93
Marathi FIRE 2014 Fasttext Embeddings ACL 2018 Paper 61.46
Bengali FIRE 2014 Fasttext Embeddings ACL 2018 Paper 55.61
Malayalam FIRE 2014 Fasttext Embeddings ACL 2018 Paper 64.59
Tamil FIRE 2014 Fasttext Embeddings ACL 2018 Paper 65.39

PPS: The reason for difference in monolingual NER performance for Bengali, Tamil and Malayalam compared to the published results are due to certain pre-processing steps which were not performed in the ACL 2018 paper. We have observed that some of the sentences have length greater than 200 words. Manually splitting these longer sentences into smaller ones using ‘|’ as delimiter lead to substantial improvement. Also, these models are trained using common-crawl embeddings as opposed to wikipedia embeddings

Avatar
Rudra Murthy V
Browses memes and watches series to escape from reality

My research interests include multilingual learning for various Natural Language Processing Tasks.

Related