Monolingual NER Results for various Languages
The Neural NER system implemented by me as part of the papers TALLIP paper and ACL 2018 Paper achieves the following F1-Scores on various languages.
Results
Language | Dataset | Word Embeddings | Reference | F1 Score |
---|---|---|---|---|
English | CoNLL 2003 | Spectral Embeddings | Arxiv Paper | 90.94 |
Spanish | CoNLL 2002 | Spectral Embeddings | Arxiv Paper | 85.75 |
Dutch | CoNLL 2002 | Spectral Embeddings | Arxiv Paper | 85.20 |
German | Link | Spectral Embeddings | ACL 2018 Paper | 87.64 |
Italian | Evalita 2009 | Spectral Embeddings | ACL 2018 Paper | 75.98 |
Hindi | FIRE 2014 | Fasttext Embeddings | ACL 2018 Paper | 64.93 |
Marathi | FIRE 2014 | Fasttext Embeddings | ACL 2018 Paper | 61.46 |
Bengali | FIRE 2014 | Fasttext Embeddings | ACL 2018 Paper | 55.61 |
Malayalam | FIRE 2014 | Fasttext Embeddings | ACL 2018 Paper | 64.59 |
Tamil | FIRE 2014 | Fasttext Embeddings | ACL 2018 Paper | 65.39 |
PPS: The reason for difference in monolingual NER performance for Bengali, Tamil and Malayalam compared to the published results are due to certain pre-processing steps which were not performed in the ACL 2018 paper. We have observed that some of the sentences have length greater than 200 words. Manually splitting these longer sentences into smaller ones using ‘|’ as delimiter lead to substantial improvement. Also, these models are trained using common-crawl embeddings as opposed to wikipedia embeddings