AfricaDailyAI
← Index

New AI Corpus Bridges Scientific Knowledge Gap in African Languages

The dominance of colonial languages in African education and scientific discourse presents a significant barrier for hundreds of millions of indigenous language speakers, limiting their access to and ability to produce scientific knowledge. A core issue is the underdeveloped scientific terminology within these African languages, which hinders effective communication and learning.

In response, researchers have introduced AfriScience-MT, a groundbreaking parallel corpus designed to facilitate machine translation across eleven scientific domains into six key African languages: Amharic, Hausa, Luganda, Northern Sotho, Yorùbá, and isiZulu. The creation process involved professional translators working alongside expert science communicators to translate plain-language scientific summaries and, crucially, to coin new scientific terms where none existed previously.

This robust corpus was then utilized to benchmark various machine translation systems and large language models (LLMs) across zero-shot, few-shot, and fine-tuned configurations. The results indicate that closed-source models, such as GPT-5.4 and Gemini-3.1-Flash-Lite, currently demonstrate superior performance at both sentence and document levels. Nevertheless, fine-tuned open-source systems like NLLB-1.3B also showed promising capabilities, suggesting a pathway for accessible AI solutions.

AfriScience-MT represents a vital step towards decolonizing science in Africa by empowering local populations to engage with scientific concepts in their native tongues. By making this corpus publicly available, the project aims to foster further research and development in scientific machine translation for African languages, ultimately enhancing scientific literacy, promoting indigenous knowledge production, and contributing to a more inclusive and equitable scientific landscape across the continent.

More in research

The dispatch

One email a day. The AI stories shaping Africa.

Rewritten for clarity, sourced always. No spam; unsubscribe anytime.