LINGUIST List 36.2795

Wed Sep 17 2025

Software: GaroVec: Word Embeddings for A’chik/Garo Language Technology

Editor for this issue: Daniel Swanson <daniellinguistlist.org>



Date: 16-Sep-2025
From: B Nyalang <nyalangmwirelabs.com>
Subject: GaroVec: Word Embeddings for A’chik/Garo Language Technology
E-mail this message to a friend

GaroVec is a set of static word embeddings trained on curated monolingual corpora in Garo (A’chik), a language spoken across Meghalaya and parts of Northeast India. Developed by MWire Labs, this resource is part of a growing effort to support inclusive, regionally grounded NLP for underrepresented languages.

Linguistic Context
Garo belongs to the Tibeto-Burman family and is widely spoken in districts like West Garo Hills, East Garo Hills, and South Garo Hills. Despite its vitality, Garo remains digitally underserved—especially in foundational NLP infrastructure. GaroVec aims to support a range of downstream tasks while respecting the linguistic diversity and cultural depth of A’chik communities.

Model Overview
- Type: Static word embeddings (FastText-style)
- Dimensions: 300
- Training Data: Cleaned and deduplicated Garo monolingual corpora
- CC BY 4.0 — permissive for research, civic tech, and educational use with attribution
- Hosted on: Hugging Face with full documentation https://huggingface.co/MWirelabs/GaroVec

Use Cases
GaroVec is designed to be modular and adaptable. It can support:
- Semantic search and clustering
- Text classification and topic modeling
- Dialectal variation analysis
- Educational tools and civic applications
- Cross-lingual transfer for low-resource modeling

Inclusive Design
This model is part of a broader movement to make language technology more inclusive—especially for communities whose languages are often overlooked in mainstream NLP. GaroVec is released with permissive licensing and timestamped provenance to encourage reuse, adaptation, and collaboration.

Linguistic Field(s): Computational Linguistics
General Linguistics
Language Documentation
Semantics
Text/Corpus Linguistics

Subject Language(s): Garo (grt)

Language Family(ies): Tibeto-Burman




Page Updated: 17-Sep-2025


LINGUIST List is supported by the following publishers: