LINGUIST List 36.2742
Mon Sep 15 2025
Software: KhasiBERT: Foundational Language Model for Khasi
Editor for this issue: Daniel Swanson <daniellinguistlist.org>
Date: 12-Sep-2025
From: B Nyalang <jasonnyalgmail.com>
Subject: KhasiBERT: Foundational Language Model for Khasi
E-mail this message to a friend
KhasiBERT is the first open-source AI language model trained exclusively on Khasi-language corpora. Developed by MWire Labs, it supports civic NLP tasks such as translation, summarization, and search, and is designed for linguistic preservation and inclusive digital access.
Khasi belongs to the Khasic branch of the Austroasiatic language family and is spoken by over 1.4 million people in Northeast India. Despite its active use, Khasi remains underrepresented in digital infrastructure and linguistic research.
KhasiBERT is publicly available at:
https://mwirelabs.com/models/khasibert
Artifacts include:
- Model weights and training logs
- Corpus preparation methodology
This resource is intended for researchers, educators, and civic technologists working on low-resource NLP, linguistic preservation, and inclusive AI.
Linguistic Field(s): Computational Linguistics
Language Documentation
Text/Corpus Linguistics
Subject Language(s): Khasi (kha)
Page Updated: 15-Sep-2025
LINGUIST List is supported by the following publishers: