Wojood - Arabic NER
Photo by rawpixel on UnsplashWojood consists of about 550K tokens (MSA and dialect) that are manually annotated with 21 entity types (e.g., person, organization, location, event, date, etc). It covers multiple domains and was annotated with nested entities. The corpus contains about 75K entities and 22.5% of which are nested. A nested named entity recognition (NER) model based on BERT was trained (F1-score 88.4%).

Mohammed Khalilia (محمد عبد الستار قاسم) is a researcher, computer scientist, and data scientist with a PhD in Computer Science from the University of Missouri. Following his doctorate, he joined Georgia Tech’s Computational Science and Engineering school and Emory University as a Postdoctoral Fellow, where his research spanned predictive modeling, relational cluster analysis, and health and nursing informatics.
He then spent nearly five years at Amazon, working across Amazon Web Services (AWS) and Amazon Studios in natural language processing (NLP), speech synthesis, and computer vision. In 2018, he was part of the team that launched Comprehend Medical, Amazon’s NLP service for clinical and biomedical text. At Qualtrics, he developed the company’s first fine-tuned large language model, trained synthetic sampling model, and worked on conversational machine learning, and active learning.
He is also an adjunct professor at Birzeit University, where he teaches NLP courses for doctoral students.