LINGUIST List 36.2795
Wed Sep 17 2025
Software: GaroVec: Word Embeddings for A’chik/Garo Language Technology
Editor for this issue: Daniel Swanson <daniellinguistlist.org>
Date: 16-Sep-2025
From: B Nyalang <nyalangmwirelabs.com>
Subject: GaroVec: Word Embeddings for A’chik/Garo Language Technology
E-mail this message to a friend
GaroVec is a set of static word embeddings trained on curated monolingual corpora in Garo (A’chik), a language spoken across Meghalaya and parts of Northeast India. Developed by MWire Labs, this resource is part of a growing effort to support inclusive, regionally grounded NLP for underrepresented languages.
Linguistic Context
Garo belongs to the Tibeto-Burman family and is widely spoken in districts like West Garo Hills, East Garo Hills, and South Garo Hills. Despite its vitality, Garo remains digitally underserved—especially in foundational NLP infrastructure. GaroVec aims to support a range of downstream tasks while respecting the linguistic diversity and cultural depth of A’chik communities.
Model Overview
- Type: Static word embeddings (FastText-style)
- Dimensions: 300
- Training Data: Cleaned and deduplicated Garo monolingual corpora
- CC BY 4.0 — permissive for research, civic tech, and educational use with attribution
- Hosted on: Hugging Face with full documentation https://huggingface.co/MWirelabs/GaroVec
Use Cases
GaroVec is designed to be modular and adaptable. It can support:
- Semantic search and clustering
- Text classification and topic modeling
- Dialectal variation analysis
- Educational tools and civic applications
- Cross-lingual transfer for low-resource modeling
Inclusive Design
This model is part of a broader movement to make language technology more inclusive—especially for communities whose languages are often overlooked in mainstream NLP. GaroVec is released with permissive licensing and timestamped provenance to encourage reuse, adaptation, and collaboration.
Linguistic Field(s): Computational Linguistics
General Linguistics
Language Documentation
Semantics
Text/Corpus Linguistics
Subject Language(s): Garo (grt)
Language Family(ies): Tibeto-Burman
Page Updated: 17-Sep-2025
LINGUIST List is supported by the following publishers: