<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Transformers | Mohammed Khalilia (محمد عبد الستار قاسم)</title><link>http://mohammedkhalilia.com/tags/transformers/</link><atom:link href="http://mohammedkhalilia.com/tags/transformers/index.xml" rel="self" type="application/rss+xml"/><description>Transformers</description><generator>HugoBlox Kit (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Thu, 11 Jan 2024 00:00:00 +0000</lastBuildDate><image><url>http://mohammedkhalilia.com/media/icon_hu_e7e672982174d01f.png</url><title>Transformers</title><link>http://mohammedkhalilia.com/tags/transformers/</link></image><item><title>Data Anonymization using NER</title><link>http://mohammedkhalilia.com/projects/anonymization/</link><pubDate>Thu, 11 Jan 2024 00:00:00 +0000</pubDate><guid>http://mohammedkhalilia.com/projects/anonymization/</guid><description>&lt;p&gt;Customers own their data, and while a data use agreement permits the use of anonymized data, raw data cannot be used
for model training. The anonymization tools are rule-based, shallow, English-only, and incapable of anonymizing text accurately
resulting in high false negative rates. Effective anonymization requires both technical solutions and human intervention — an
approach that was feasible at small scale but grows increasingly challenging as demand on annotators rises. This bottleneck
directly limits ML applications, particularly data-hungry tasks such as pre-training LLMs and supervised fine-tuning, which
require billions to trillions of tokens. The problem is further compounded by the diversity of data sources — surveys, interviews,
product feedback, and conversational data — along with the risk of data and knowledge leakage between brands.
To overcome these challenges, a named entity recognition model was trained on classify each word into one of&lt;br&gt;
37 possible classes. The model is trained on six languages.
&lt;figure &gt;
&lt;div class="flex justify-center "&gt;
&lt;div class="w-full" &gt;
&lt;img alt="alt"
srcset="http://mohammedkhalilia.com/projects/anonymization/ner_flow_diagram_hu_182338e7340e9b8d.webp 320w, http://mohammedkhalilia.com/projects/anonymization/ner_flow_diagram_hu_f8bcaf50370daf13.webp 480w, http://mohammedkhalilia.com/projects/anonymization/ner_flow_diagram_hu_19b6b9fde9c9e06c.webp 760w"
sizes="(max-width: 480px) 100vw, (max-width: 768px) 90vw, (max-width: 1024px) 80vw, 760px"
src="http://mohammedkhalilia.com/projects/anonymization/ner_flow_diagram_hu_182338e7340e9b8d.webp"
width="760"
height="332"
loading="lazy" data-zoomable /&gt;&lt;/div&gt;
&lt;/div&gt;&lt;/figure&gt;
&lt;/p&gt;</description></item></channel></rss>