Elevating Twi NLP through Diverse and Verified Datasets
1University of Education, Winneba
2ML Collective
This study aims to create a scalable, high-quality Twi-English parallel corpus with wide domain coverage, an update mechanism, and a human-in-the-loop evaluation pipeline.
The efficacy of LLMs like GPT largely depends on the quality and diversity of their training datasets. However, for languages such as Twi, there is a noticeable scarcity of high-quality, diverse datasets. Preliminary research showed that datasets are often derived from limited contexts, such as religious texts, which do not provide the breadth for developing robust language models.
This hinders representation in multilingual LLMs and the development of downstream tools for education, healthcare, and civic tech.
A large Twi-English corpus covering multiple domains, including literature, healthcare, social media, and news articles. This diverse dataset will enable more robust and contextually-aware language models for Twi.
Unlike static datasets, our framework automatically replaces outdated translations when stronger models (e.g., DeepSeek) become available. This ensures the corpus remains current and improves over time with advances in translation technology.
An intuitive web and mobile interface enables expert reviewers to validate translations, suggest corrections, and track quality over time. This human-in-the-loop approach ensures high-quality data while maintaining scalability and efficiency.
Check back soon for tool release!
We source raw textual data from diverse domains, including literature, health, social media, and news articles. Texts are cleaned and aligned at the sentence level. To address domain imbalance, we supplement underrepresented categories with synthetically generated sentence pairs that simulate natural use cases.
Initial translations are generated using large language models (LLMs) such as NLLB-200 and Gemini 1.5, prompted with example-rich templates to improve fluency and contextual relevance. Google Translate will also be used to aid in the translation quality.
Selected samples undergo final review via an interactive interface where linguists and fluent speakers evaluate translations based on clarity, correctness, and cultural fit.
The dataset will directly support the development of machine translation systems, voice assistants, and educational tools tailored to Twi speakers. In healthcare, it can improve patient-doctor communication through accurate translation of medical instructions and consent forms.
In governance, it can enable the delivery of public service announcements and legal information in a language that over 9 million Ghanaians understand. By grounding NLP systems in local linguistic realities, our work lays the foundation for inclusive AI in West Africa.
Our approach addresses key challenges in low-resource language development, namely, data scarcity, limited contextual diversity, and lack of verification pipelines. By integrating human expertise with automated translation and validation tools, we lay the groundwork for scalable, high-quality dataset creation.
@article{mireku2025,
author = {Mireku, Jackline and Mireku, Prince},
title = {From Limited Contexts to Rich Data: Elevating Twi NLP},
year = {2025},
}
For any questions, please contact us at mirekuprince66@gmail.com.