paper
arXiv cs.CL
November 18th, 2025 at 5:00 AM

Vashantor: A Large-scale Multilingual Benchmark Dataset for Automated Translation of Bangla Regional Dialects to Bangla Language

arXiv:2311.11142v2 Announce Type: replace Abstract: The Bangla linguistic variety is a fascinating mix of regional dialects that contributes to the cultural diversity of the Bangla-speaking community. Despite extensive study into translating Bangla to English, English to Bangla, and Banglish to Bangla in the past, there has been a noticeable gap in translating Bangla regional dialects into standard Bangla. In this study, we set out to fill this gap by creating a collection of 32,500 sentences, encompassing Bangla, Banglish, and English, representing five regional Bangla dialects. Our aim is to translate these regional dialects into standard Bangla and detect regions accurately. To tackle the translation and region detection tasks, we propose two novel models: DialectBanglaT5 for translating regional dialects into standard Bangla and DialectBanglaBERT for identifying the dialect's region of origin. DialectBanglaT5 demonstrates superior performance across all dialects, achieving the highest BLEU score of 71.93, METEOR of 0.8503, and the lowest WER of 0.1470 and CER of 0.0791 on the Mymensingh dialect. It also achieves strong ROUGE scores across all dialects, indicating both accuracy and fluency in capturing dialectal nuances. In parallel, DialectBanglaBERT achieves an overall region classification accuracy of 89.02%, with notable F1-scores of 0.9241 for Chittagong and 0.8736 for Mymensingh, confirming its effectiveness in handling regional linguistic variation. This is the first large-scale investigation focused on Bangla regional dialect translation and region detection. Our proposed models highlight the potential of dialect-specific modeling and set a new benchmark for future research in low-resource and dialect-rich language settings.

#ai
#research

Score: 2.80

Engagement proxy: 0

Canonical link: https://arxiv.org/abs/2311.11142