Data Anonymization using NER
Photo by rawpixel on UnsplashCustomers own their data, and while a data use agreement permits the use of anonymized data, raw data cannot be used
for model training. The anonymization tools are rule-based, shallow, English-only, and incapable of anonymizing text accurately
resulting in high false negative rates. Effective anonymization requires both technical solutions and human intervention — an
approach that was feasible at small scale but grows increasingly challenging as demand on annotators rises. This bottleneck
directly limits ML applications, particularly data-hungry tasks such as pre-training LLMs and supervised fine-tuning, which
require billions to trillions of tokens. The problem is further compounded by the diversity of data sources — surveys, interviews,
product feedback, and conversational data — along with the risk of data and knowledge leakage between brands.
To overcome these challenges, a named entity recognition model was trained on classify each word into one of
37 possible classes. The model is trained on six languages.

Mohammed Khalilia (محمد عبد الستار قاسم) is a researcher, computer scientist, and data scientist with a PhD in Computer Science from the University of Missouri. Following his doctorate, he joined Georgia Tech’s Computational Science and Engineering school and Emory University as a Postdoctoral Fellow, where his research spanned predictive modeling, relational cluster analysis, and health and nursing informatics.
He then spent nearly five years at Amazon, working across Amazon Web Services (AWS) and Amazon Studios in natural language processing (NLP), speech synthesis, and computer vision. In 2018, he was part of the team that launched Comprehend Medical, Amazon’s NLP service for clinical and biomedical text. At Qualtrics, he developed the company’s first fine-tuned large language model, trained synthetic sampling model, and worked on conversational machine learning, and active learning.
He is also an adjunct professor at Birzeit University, where he teaches NLP courses for doctoral students.