AfricaDailyAI
← Back Home
ResearchJul 5, 2026NigeriaNigerBenin93% confidence

Evaluating Large Language Models for African Languages: Performance Gaps and Metric Reliability for Hausa and Fongbe

A new study investigates the translation capabilities of leading large language models (LLMs) for English-to-Hausa and English-to-Fongbe, two distinct West African languages. The research evaluates four prominent LLMs—GPT-4o Mini, Claude Sonnet 4, Gemini 2.5 Flash, and Qwen2.5-7B—across varying scales, using both automatic metrics and human evaluation by native speakers. This comprehensive approach aims to understand LLM performance in low-resource linguistic contexts and the reliability of standard evaluation metrics.

The findings reveal significant disparities in translation quality: Hausa achieved acceptable human-rated quality (4.0-4.5/5), while Fongbe translations were notably poor (1.0-2.2/5), showing a consistent threefold BLEU score gap. Furthermore, the study highlights that model rankings varied between the two languages, with Gemini excelling for Fongbe and GPT-4o for Hausa. This suggests that an LLM's performance on one low-resource African language does not reliably predict its performance on another, underscoring the need for language-specific evaluations.

Critically, the research found a dramatic variance in how well automatic metrics correlated with human judgment. While Fongbe showed perfect rank correlation, Hausa exhibited a weak correlation (rho=0.5), with human evaluators preferring GPT-4o even when automatic metrics ranked Claude higher. The study also identified embedding collapse in neural metrics like BERTScore for both languages, limiting their ability to accurately differentiate translation quality. These issues raise concerns about relying solely on automated scores for assessing LLM efficacy in diverse linguistic environments.

Based on these insights, the researchers recommend a multi-metric evaluation strategy for low-resource African languages, advising particular caution when interpreting neural metrics. They also established that a minimum sample size of 2,500 sentences is necessary for stable system rankings, as smaller datasets produced unreliable results. This research provides crucial guidance for developing and deploying AI translation tools that are genuinely effective and accurately assessed for African linguistic diversity.

More in research

The dispatch

One email a day. The AI stories shaping Africa.

Rewritten for clarity, sourced always. No spam; unsubscribe anytime.