Universal Dependencies Corpus for Zomi: Promoting Digital Inclusion

Natural Language Processing (NLP) is at the heart of today’s digital world. It powers translation tools, voice assistants, and grammar checkers. While languages like English, Chinese, and Spanish benefit from decades of research and abundant digital resources, many minority languages remain digitally invisible. For example, Zomi is spoken by over 600,000 people in Myanmar, India, and diaspora communities. A recent Master’s thesis at the University of Strasbourg’s TCLoc program has taken a groundbreaking step to change this. Master student Tun Tun Aung created the first Universal Dependencies (UD)-compliant morphological corpus for Zomi. This work lays the foundation for future language technologies and digital inclusion.

Why Zomi Matters in the Digital Age

Zomi, also known as Tedim Chin or Paite, belongs to the Kuki-Chin subgroup of Tibeto-Burman languages. Despite this, Zomi has little digital representation. There are no translation tools, no standardized input systems, and no NLP resources to support education or communication.

Moreover, this lack of visibility in the digital space is more than a technical issue, it’s a question of cultural preservation and social justice. As UNESCO highlights, languages excluded from technology risk marginalization and even endangerment.

By developing a UD-compliant corpus, therefore, this project ensures that Zomi can be included in global research, machine learning, and future NLP applications.

Building the Zomi UD Corpus

The Zomi corpus includes 10,583 tokens collected from:

Zomi Bible translations (formal, structured language)
Online news websites (modern journalistic language)
Social media posts (informal, user-generated content)

Each token was annotated with:

Part-of-speech tags
Morphological features (tense, aspect, case, evidentiality)
Syntactic dependencies following UD conventions

To handle Zomi’s unique linguistic features, such as agglutinative verb morphology, clause chaining, pronominal clitics, and reduplication, Tun Tun Aung developed custom annotation guidelines. Tools like UD Annotatrix, INCEpTION, and spaCy supported the process, with manual review ensuring quality.

Key Linguistic Insights

The annotated Zomi corpus revealed several distinctive features:

Verb-final morphology: Verbs stack multiple suffixes for tense, aspect, mood and evidentiality.
Pronominal clitics: Subjects and objects are attached to verbs as bound morphemes.
SOV word order: Sentences typically follow a Subject–Object–Verb pattern.
Clause chaining: Multiple subordinate clauses precede the main verb.
Reduplication and classifiers: Used for emphasis and quantification.

These insights not only help describe Zomi more precisely but also provide valuable data for comparative research across Tibeto-Burman languages.

From Corpus to NLP Applications

Although modest in size, the Zomi corpus has already been tested for baseline NLP experiments like part-of-speech tagging and dependency parsing. Results showed promise, though challenges remain in handling complex verb forms and clause structures.

More importantly, this project demonstrates that even small annotated datasets can:

Enable cross-linguistic transfer learning with multilingual models like XLM-R
Serve as training data for Zomi-specific NLP tools
Provide a foundation for machine translation, grammar checking, and digital education resources

Toward Digital Inclusion and Language Preservation

By creating this resource, Tun Tun Aung has taken an important step toward digital inclusion for the Zomi-speaking community. The corpus supports:

Language preservation in the digital age
Educational resources for teachers and learners
Cultural visibility for Zomi speakers online
Global linguistic diversity, by bringing a marginalized language into the UD ecosystem

This work also serves as a replicable model for other under-resourced languages worldwide, proving that building small, well-annotated corpora is both feasible and impactful.

Future research aims to expand the Zomi corpus, develop parallel translation datasets, and train machine-learning models for Zomi and related languages. These efforts could make Zomi accessible in tools like Google Translate, voice assistants, and educational platforms—bridging the gap between minority languages and the digital world.

The creation of a Universal Dependencies morphological corpus for Zomi is more than a technical achievement, it’s another milestone in linguistic equity, cultural preservation, and digital empowerment.

TCLoc Master’s projects and academic papers demonstrate how research produces tangible, real-world impact by giving a digital voice to languages at risk of being left behind. With guidance from top-notch NLP experts, Alex Yanishevski and Pablo Ruiz Fabo, the program bridges theory and practice, ensuring that innovations developed in the classroom shape the future of global communication. Renate de la Paix and Jeff Allen served on the thesis defense committee.

To get access to the full master thesis, please use the contact form https://mastertcloc.unistra.fr/contact-us/ or reach out to Tun Tun Aung.