Evaluating Large Language Models for African Languages: Performance Gaps and Metric Reliability for Hausa and Fongbe
A new study investigates the translation capabilities of leading large language models (LLMs) for English-to-Hausa and English-to-Fongbe, two distinct West African languages. The research evaluates four prominent LLMs—GPT-4o Mini, Claude Sonnet 4, Gemini 2.5 Flash, and Qwen2.5-7B—across varying scales, using both automatic metrics and human evaluation by native speakers. This comprehensive approach aims to understand LLM performance in low-resource linguistic contexts and the reliability of standard evaluation metrics.
The findings reveal significant disparities in translation quality: Hausa achieved acceptable human-rated quality (4.0-4.5/5), while Fongbe translations were notably poor (1.0-2.2/5), showing a consistent threefold BLEU score gap. Furthermore, the study highlights that model rankings varied between the two languages, with Gemini excelling for Fongbe and GPT-4o for Hausa. This suggests that an LLM's performance on one low-resource African language does not reliably predict its performance on another, underscoring the need for language-specific evaluations.
Critically, the research found a dramatic variance in how well automatic metrics correlated with human judgment. While Fongbe showed perfect rank correlation, Hausa exhibited a weak correlation (rho=0.5), with human evaluators preferring GPT-4o even when automatic metrics ranked Claude higher. The study also identified embedding collapse in neural metrics like BERTScore for both languages, limiting their ability to accurately differentiate translation quality. These issues raise concerns about relying solely on automated scores for assessing LLM efficacy in diverse linguistic environments.
Based on these insights, the researchers recommend a multi-metric evaluation strategy for low-resource African languages, advising particular caution when interpreting neural metrics. They also established that a minimum sample size of 2,500 sentences is necessary for stable system rankings, as smaller datasets produced unreliable results. This research provides crucial guidance for developing and deploying AI translation tools that are genuinely effective and accurately assessed for African linguistic diversity.
More in research
African Language AI Performance: Data Quantity Alone Not Enough, Study Finds
This study reveals that simply increasing data volume does not guarantee improved AI performance for African languages, highlighting the need for language-sensitive dataset…
Researchers Uncover Optimal Prompting Strategies for AI Models in African Languages
A new study investigates prompting strategies for Natural Language Inference (NLI) in low-resource African languages like Swahili, Yoruba, and Hausa. The research highlights that…
New AI Text-to-Speech Benchmark Prioritizes Underrepresented Languages, Showing Strong Performance for African Tongues
A new AI text-to-speech benchmark, OpenBibleTTS, includes 37 underrepresented languages, with specific models showing strong intelligibility and user preference in several African…
Unpacking the Illusion: How LLMs Misrepresent African Languages and Cultures
Dr. Shamsuddeen will discuss how large language models (LLMs) misrepresent African languages and cultural contexts, despite two decades of progress in AfricaNLP. He will highlight…
The dispatch
One email a day. The AI stories shaping Africa.
Rewritten for clarity, sourced always. No spam; unsubscribe anytime.