MULTILINGUAL CODE-MIXED TEXT NORMALIZATION USING DEEP LE

Mr. Ganesh Bhagwat

doi:10.7492/5g2mpn62

Authors

Mr. Ganesh Bhagwat Author

DOI:

https://doi.org/10.7492/5g2mpn62

Keywords:

Code-Mixed Text, Text Normalization, Deep Learning, Multilingual NLP, Transformer, Sequence-to-Sequence

Abstract

The increasing adoption of digital communication platforms has significantly altered the structure and style of written language. Users today communicate through social media, messaging applications, online forums, and service portals using short, informal, and conversational expressions rather than grammatically structured sentences. In multilingual societies such as India, this behavior becomes more complex due to the frequent mixing of multiple languages within a single expression.

Such communication, commonly referred to as code-mixed text, often combines vocabulary from different languages, phonetic spellings, abbreviations, and inconsistent grammatical structures. Words belonging to regional languages are frequently written using Roman script instead of native scripts, producing multiple orthographic variations for the same semantic concept. While humans interpret such text effortlessly using contextual knowledge, Natural Language Processing (NLP) systems struggle to process it reliably because most models are trained on clean and standardized datasets.