Arabic Fine-Grained Entity Recognition

Abstract

Traditional NER systems are typically trained to recognize coarse-grained entities, and less attention is given to classifying entities into a hierarchy of fine-grained lower-level subtypes. This article aims to advance Arabic NER with fine-grained entities. We chose to extend Wojood (an open-source Nested Arabic Named Entity Corpus) with subtypes. In particular, four main entity types in Wojood, geopolitical entity (GPE), location (LOC), organization (ORG), and facility (FAC), are extended with $31$ subtypes. To do this, we first revised Wojood’s annotations of GPE, LOC, ORG, and FAC to be compatible with the LDC’s ACE guidelines, which yielded $5,614$ changes. Second, all mentions of GPE, LOC, ORG and FAC (44K) in Wojood are manually annotated with the LDC's ACE subtypes. We refer to this extended version of Wojood as Wojood-Fine. To evaluate our annotations, we measured the inter-annotator agreement (IAA) using both Cohen’s Kappa and F1 score, resulting in 0.9861 and 0.9889, respectively. To compute the baselines of Wojood-Fine, we fine-tune three pre-trained Arabic BERT encoders in three settings flat NER, nested NER and nested NER with subtypes and achieved F1 score of 0.920, 0.866, and 0.885, respectively. Our corpus and models are open-source and available at https://sina.birzeit.edu/wojood/.

Publication
Proceedings of The First Arabic Natural Language Processing Conference (ArabicNLP 2023, co-hosted with EMNLP 2023)
Click the Cite button above to demo the feature to enable visitors to import publication metadata into their reference management software.
Create your slides in Markdown - click the Slides button to check out the example.

Supplementary notes can be added here, including code, math, and images.