Building with Africa: Afrocentric AI
Africa is a remarkable continent, home to a dazzling mosaic of diverse communities and breathtaking natural habitats. Renowned for its incredible biodiversity, it boasts sweeping savannas, lush rainforests, majestic mountains, and captivating coastal regions. Throughout its long history, the people of Africa have exemplified ambition, resilience, and innovation — shaping a rich cultural heritage and leaving an indelible mark on the global stage. Among its many treasures, Africa’s linguistic landscape is particularly striking: with over 2,000 languages and dialects spoken, it stands as one of the world’s most linguistically diverse regions. This incredible breadth of communication fosters a uniquely dynamic space for innovation, where new ideas, technologies, and cultural exchanges flourish upon the foundations of rich multilingualism.
However, when it comes to technology, African languages have long remained underrepresented. My team and I set out to address this gap, joining forces with collaborators across Africa and around the globe on a journey to elevate these languages in the digital world. In this article, I offer an account of our journey thus far— told through five pivotal publications: Towards Afrocentric NLP, AfroLID, SERENGETI, Cheetah, and Toucan — highlighting our motivations, key lessons learned, and the transformative potential I envision for communities across the continent.
A Vision of Afrocentric NLP
Paper: “Towards Afrocentric NLP”. ACL 2022, link.
Our journey began with a critical realization: mainstream natural language processing (NLP) research overwhelmingly focused on a select few high-resource languages, leaving African languages severely underrepresented. This gap was not merely academic — it had profound consequences for digital access, education, health, communication, and the preservation of cultural and linguistic heritage. The same disparity extended — and largely continues — in industry settings. For instance, widely used platforms such as Twitter (now X) have historically struggled to identify (or label) text from more than a handful of African languages, typically recognizing only about a dozen. As a result, crucial NLP-driven innovations such as machine translation, speech recognition, text summarization, and sentiment analysis have remained largely inaccessible to speakers of most African languages, underscoring the urgency of our mission.
“Although it has been argued that the best way to achieve cross-linguistically useful NLP is to leverage findings of typological research (Bender, 2016), most NLP work remains Indo-Eurocentric in terms of algorithms for pre-processing, training, and evaluation. This is a mismatch to the fact that every NLP approach requires either explicit or implicit representative linguistic knowledge (O’Horan et al., 2016; Ponti et al., 2019; Bender, 2016)”. (Adebara and Abdul-Mageed, 2022)
In “Towards Afrocentric NLP,” we first illustrate why developing NLP for African languages must be deeply grounded in the linguistic realities of these languages. We detail several typological attributes unique to African languages — such as tone systems, vowel harmony, and complex morphological structures — that pose unique challenges and opportunities for NLP. We then call for a paradigm shift. Instead of importing one-size-fits-all approaches, we advocate building technologies rooted directly in the needs and realities of African communities, informed by their own linguistic practices. Practically, this involves accommodating everyday language use — including code-switching and diverse local scripts — and prioritizing collaboration with local researchers, linguists, and stakeholders.
In pursuing this vision, we identify several core obstacles:
- Absence of Progressive Language Policies: Indigenous African languages often experience limited literacy support due to complete absence or insufficient language policies, including in education. Such policies, when they exist, rarely ensure sustained, high-quality instruction in Indigenous languages throughout the educational system, and often provide only minimal exposure. To address this, we call for robust policy reforms: integrating Indigenous languages into education and media consistently; promoting community-driven literacy initiatives; developing language-specific educational resources (textbooks, digital media); and creating inclusive NLP technologies tailored to partly literate or newly literate individuals (such as text-to-speech interfaces, simplified reading applications, and visually supported language tools).
- Linguistic Diversity: Africa’s extraordinary linguistic landscape makes it impractical to adopt uniform, monolithic NLP methodologies. Tailoring specialized approaches to each language family, cluster, or even individual languages becomes essential.
- Underrepresentation: Historically, African languages have been scarcely represented in major NLP models, benchmarks, and research initiatives. This chronic exclusion has led to significant gaps in technological support, limiting opportunities for linguistic communities across the continent
- Data Scarcity: The absence of publicly accessible datasets severely limits NLP model development. Most African languages lack sufficient training data, and many remain entirely unrepresented, creating significant barriers to meaningful computational work.
- Data Quality: When available, automatically collected data (e.g., web crawls) frequently contains considerable noise—such as mislabelled language data or undetected machine-generated text—undermining dataset reliability. We thus advocate a meticulous approach to dataset curation, drawing upon domain-specific texts (such as government publications and religious translations) and content carefully verified by native speakers.
Together, these insights underscore the necessity of community-centered, linguistically informed, and culturally respectful approaches to NLP. This foundation became our roadmap as we embarked on our Afrocentric NLP journey.
Across several language families including Afro-Asiatic, Austronesian, Niger-Congo, Nilo-Saharan, Indo-European and Creole, notable typological features prevalent in African languages “include use of tone, open syllables, vowel harmony, splitting verbs, serial verb construction, reduplication, use of very few or no adjectives”, and “a large number of ideophones”. (Adebara and Abdul-Mageed, 2022)
We left that paper with a bold vision: to develop and deploy language technologies that genuinely speak to Africa’s linguistic wealth. It is not just about building a single model or dataset, but laying the foundation for a movement that aligns technology with local needs, from literacy and organizational policies to education and healthcare.
One of the most important recommendations we would like to emphasize is to prioritize African NLP work based on the needs of African communities. For example, we believe development for data and tools for improving health and education should be a priority. We also caution against extractive practices, and encourage creation of opportunities, contexts, and venues for work on African languages and advocacy for reclaiming African language policies. In addition, data literacy and issues around data sovereignty and privacy should remain of highest importance. We highlighted various communities and venues here that we think should continue to be supported. (Adebara and Abdul-Mageed, 2022)
Unlocking Africa’s Linguistic Landscape
Paper: “AfroLID: A Neural Language Identification Tool for African Languages”. EMNLP 2022, link.
Before developing robust NLP solutions for African languages, we needed to solve a foundational challenge: reliably identifying the language of a given text. Language identification — often straightforward for high-resource languages — becomes complex in African contexts, where languages frequently share scripts, borrow vocabulary extensively, or lack standardized orthographies.
To address this gap, we introduced AfroLID, a large-scale neural language identification model capable of distinguishing among 517 African languages and their variants. AfroLID represented a substantial leap forward, significantly surpassing the coverage of any existing tool. To achieve this, we meticulously curated diverse, multi-domain datasets from news outlets, religious texts, government documents, and community forums, aiming to authentically reflect the linguistic realities and richness of the continent. Key strengths of AfroLID, include:
- Truly pan‑African coverage: AfroLID recognises 517 languages and language varieties spoken in 50 African countries and belonging to 14 genealogical families (e.g., Niger‑Congo, Afro‑Asiatic, Nilo‑Saharan). To accommodate the continent’s orthographic diversity it handles five writing systems: 499 languages in Latin script, eight in Ethiopic, four in Arabic, one in Vai, and one in Coptic. (See map visualising this reach, underscoring how many micro‑languages and dialect clusters are included).
- State‑of‑the‑art accuracy: Trained on a manually curated, multi‑domain corpus of ~2.5 million sentences, the best AfroLID model (a 200 M‑parameter Transformer with BPE vocabulary) scores 95.9 macro‑F1 and 96.0 % accuracy on a blind 51 k‑sentence test set. Over 80% of the languages achieve ≥ 95 F1, and 128 languages score a perfect 100 F1. In head‑to‑head comparisons with widely‑used tools (CLD2, CLD3, Franc, LangDetect, langid.py) AfroLID wins on “nearly every language”, often by double‑digit margins, and retains high accuracy on out‑of‑domain Twitter data — evidence that it generalizes beyond its training genres.
- Release and downstream potential: We released AfroLID publicly on GitHub, making it immediately usable for data cleaning, corpus mining, or live language detection in multilingual platforms. By reliably separating African languages in noisy web text, researchers can now build higher‑quality translation systems, sentiment analyzers, and speech tools without expensive manual filtering — lowering the entry barrier for inclusive NLP across the continent.
AfroLID opens new possibilities, enabling researchers to effectively identify and focus on specific African languages or language groups within datasets. This advancement facilitated higher-quality data collection and more precise linguistic analysis. From crowd-sourced translations to local-language sentiment analysis, AfroLID provides an essential foundation for developing meaningful multilingual technologies tailored specifically for African communities.
Next‑generation AfroLID will need foreign‑language classes, explicit code‑switch handling, richer dialect labels, broader domain sampling, and deeper community review to fully realize its promise.
Despite AfroLID’s leap forward, several challenges remain. Its current release deliberately omits high‑resource languages like English, French, and Portuguese, limiting usefulness in mixed‑language settings (though adding them is straightforward). Code‑mixing and creole varieties still trip the system — for example, it confuses Cameroonian and Nigerian Pidgin or closely related Portuguese‑based creoles. Dialect clustering is another hurdle: roughly 70 % of errors for South‑African languages arise from misclassifying near‑neighbours that share vocabulary and spelling conventions. Real‑world robustness is also uneven: on a million “undefined” tweets, AfroLID excelled overall yet struggled with heavy code‑switching, scoring only 12 % for Nigerian Pidgin and 44 % for Yorùbá. Finally, limited native‑speaker validation leaves some predictions unchecked, raising questions about hidden biases in data and models. Next‑generation AfroLID will need foreign‑language classes, explicit code‑switch handling, richer dialect labels, broader domain sampling, and deeper community review to fully realize its promise.
SERENGETI: Foundation Models for 517 African Languages
Paper: “SERENGETI: Massively Multilingual Language Models for Africa”. ACL 2023 Findings, link.
Having tackled language identification, we turned to the creation of foundation models that could process African languages effectively. The question was: Could we train large-scale multilingual language models if we carefully curated enough data for hundreds of African languages?
In SERENGETI, we demonstrated the viability of training such models. We developed SERENGETI, a suite of multilingual models covering 517 African languages and language varieties. SERENGETI performed strongly across eight different NLP tasks, from part-of-speech tagging and named entity recognition to sentiment analysis and question answering. Overall, we evaluated the model on 20 datasets, eight task families, 32 languages with gold labels. Our best SERENGETI model sets a new state‑of‑the‑art vs. XLM‑R, mBERT, AfriBERTa, and Afro‑XLM‑R, yielding a better average F1 across the 20 datasets.
In addition, we exploited SERENGETI to finetune a more powerful version of AfroLID covering all 517 languages. Again, we observe better performance (see tables below). We released our new African language identification tool exploiting SERENGETI, dubbing it AfroLID-v1.5 (link).
Crucially, this work also shed light on broader scientific questions and allowed us to derive a host of insights including the following:
- Genealogical clusters and language contact: We found that zero-shot transfer worked particularly well when languages were closely related or historically in frequent contact. That is, as we show in Section 6.4, SERENGETI’s zero‑shot gains are not uniform. For example, South‑African languages that are genealogically close (e.g., Zulu–Xhosa or Northern Sotho–Sesotho) or in long‑standing contact share more subword inventory and lexical overlap, which we quantify via Jaccard similarity of the pre‑training corpora (Table 8). These close pairs obtain the largest zero‑shot boosts.
- Data quality: Manually curating texts from religious sources, government gazettes, and media outlets significantly boosts performance compared to naive web crawls.
- Community collaboration: Throughout model building, we collaborated with local linguists and African NLP communities, reinforcing that knowledge of the languages themselves was as vital as raw computing power.
With SERENGETI available to the public, we aimed to inspire open research. Our hope was that these models — and the lessons learned — would accelerate the pace of African NLP, fostering new datasets, tasks, and solutions that could make a difference on the ground.
Accelerating African Language Generation
Paper: “Cheetah: Natural Language Generation for 517 African Languages”. ACL 2024, link.
When SERENGETI proved that classification‑style tasks could scale to hundreds of African languages, the natural next step was generation: summarising local news, paraphrasing health instructions, drafting stories, even translating between African tongues. That is the promise of Cheetah, our first encoder‑decoder foundation model for the continent.
Why another model?
Natural‑language generation (NLG) remains the hardest frontier for low‑resource languages: it demands far richer signal than classification, and most African languages have scarce, noisy, or non‑standardised text. Built atop SERENGETI’s approach, we added new, carefully curated data — this time focusing on generation-specific applications. The result was a family of models. Here are a few observations:
- Purpose‑built architecture — a 12‑layer T5‑style encoder‑decoder with ~580 M parameters; small enough for academic labs yet large enough for cross‑lingual transfer.
- High‑fidelity corpus — A 42 GB, expert‑curated dataset drawn from news, health, legal, religious, and social‑media sources. We retained tone diacritics, vowel‑ and consonant‑harmony marks, reduplication, and other script‑specific cues — details usually lost in web crawls — so the data faithfully reflects the phonological and orthographic richness of African languages.
- Broad coverage: Over 500 African languages, from widely spoken ones like Hausa, Swahili, and Amharic to under-served languages with only a handful of textual resources. Cheetah languages span 14 families, written in six scripts and spoken in 50 of Africa’s 54 countries.
- Robust generation: Achieving strong results in text summarization and paraphrasing tasks, often rivaling or surpassing generalized multilingual models that barely included African data.
- Human validation — We went beyond automatic metrics by asking native speakers of Hausa, Swahili, and Yorùbá to score Cheetah’s outputs for faithfulness (semantic accuracy) and fluency (naturalness) on a carefully designed test set that probes tricky phenomena such as negation, gender agreement, and polysemous verbs. Their ratings confirmed Cheetah’s edge over mT5, mT0, and AfriTeVa across most categories, yet also surfaced nuanced issues — e.g., occasional tonal‑mark omissions in Yorùbá and sporadic gender‑inflection slips in Swahili — that now guide our next round of refinements.
Cheetah thus served as a milestone for exploring how generative AI could speak African languages naturally — a critical step toward building chat assistants, storytelling platforms, or localized educational content. We saw immediate possibilities: generating local-language news bulletins, summarizing health advice in rural communities, and bridging communication gaps between educators and students.
Weaving Africa’s Linguistic Fabric with Many-to-Many Machine Translation
Paper: “Toucan: Many-to-Many Translation for 150 African Language Pairs”. ACL 2024 Findings, Link.
The final piece of our journey focused on many-to-many machine translation, a fundamental requirement for bridging Africa’s linguistic tapestry. While some progress had been made on, say, English–Swahili or English–Hausa, the continent’s languages cross many genealogical and political boundaries — and remain starved of cross-translation tools.
With Toucan, we used two newly developed “Cheetah” models (1.2B and 3.7B parameters) as a base, then further finetuned them into a many-to-many translator covering 156 language pairs — spanning 43 African languages plus Arabic, English, and French (which are widely used in Africa). We also:
- Rigorous evaluation at scale: To measure progress credibly, we built AfroLingu‑MT, the first many‑to‑many benchmark for the continent: 156 translation directions that pair 43 Indigenous African languages with Arabic, English, and French. Each split contains professionally aligned parallel data and balanced topical coverage, giving researchers a realistic stress‑test for cross‑African MT.
- State‑of‑the‑art results: Across the full benchmark, Toucan’s sequence‑to‑sequence models (1.2 B and 3.7 B parameters) post spBLEU1K scores up to ≈ 23 on test, eclipsing mT5, mT0, Afri‑mT5, AfriTeVa, and NLLB‑200 by sizeable margins; on the 59 language pairs shared with NLLB‑200‑1.3B, Toucan is ahead by +6.9 spBLEU1K and +8.0 AfriCOMET on average . These gains hold in zero‑shot settings and widen after full fine‑tuning, confirming that wide African coverage and data quality beat brute parameter count alone.
- spBLEU1K Metric, a fairer metric for low‑resource scripts: Traditional BLEU and even spBLEU cover only a subset of African orthographies. We therefore released spBLEU1K, a SentencePiece‑based scorer trained on monolingual text in 1,003 languages (614 African). It neutralises tokenisation artefacts that distort scores for agglutinative or non‑Latin scripts and correlates better with AfriCOMET than legacy metrics . By standardizing evaluation across the continent’s writing systems, spBLEU1K lets the community compare models on equal footing — no matter which African language they target.
By providing a richly evaluated many-to-many translation model, Toucan holds immediate potential. From rural healthcare clinics needing vital info in local tongues to cross-border e-commerce and cultural exchanges, we see Toucan as a step to unify Africa’s languages on the global stage let alone facilitating inter-continental communication.
Reflections and Future Directions
Our journey has shown the transformative power of combining a long-term vision, which I always strive to develop and maintain, with deep linguistic expertise, meaningful community partnerships, and carefully curated data. Reflecting on this journey, I have come to appreciate the importance of ambitious yet patient innovation — high-impact outcomes demand dedication and resilience. Equally clear is the understanding that technology thrives most when it aligns closely with community values and aspirations. Working in silos limits potential; true innovation emerges when we build together in diverse, interconnected teams and foster strong, ongoing relationships with the wider community. Genuine linguistic and cultural understanding has repeatedly proven essential, as technical advances are hollow without the nuanced insight that explains why languages behave the way they do.
High-impact outcomes demand dedication. Technology thrives when it aligns with community goals. Working in silos limits potential; true innovation is a function of building together.
Moreover, ethical AI and linguistic inclusivity lie at the heart of our mission. Our work demonstrates how careful design tailored to specific regions not only addresses local needs but also informs broader conversations on responsible data collection and equitable multilingual representation. It emphasizes the necessity of proactively identifying and mitigating biases, evaluating models through culturally aware lenses, and maintaining open, respectful dialogue about language ownership, cultural heritage, and data sovereignty.
Yet, our journey is far from complete, and our ambitions extend far beyond what we have already achieved. From initial successes in language identification, to breakthroughs in multilingual understanding, generation, and machine translation, we remain deeply conscious of the vast ground still to cover. Beyond the 517 languages we have already supported, hundreds more — many critically endangered — still lack meaningful digital presence and support. Our ultimate goal is to cultivate a vibrant, sustainable ecosystem of tools, datasets, benchmarks, and educational resources that comprehensively represent the richness and complexity of Africa’s linguistic heritage.
At a personal level, this vision is especially meaningful because language technology directly impacts people’s lives — empowering communities by enabling access to education, fostering cultural exchange, and bridging communication gaps during crises such as public health emergencies. We aspire to inspire greater international and intra-African collaboration, ensuring that each African language can flourish digitally, leaving no voice unheard or forgotten.
Concluding Note
The journey from Towards Afrocentric NLP to Toucan has been both technically demanding and deeply rewarding. Each milestone has brought us closer to a vision where African languages are not merely peripheral additions to existing NLP paradigms, but rather vibrant centers of innovation, creativity, and cultural relevance. As we look ahead, we remain dedicated to open-source collaboration, continuous improvement, and strengthening our partnerships with local communities. My recent travels and enriching conversations with colleagues and organizations around the globe have reinforced my belief in the shared commitment to this vision. By continuing to build with Africa, we seek to establish a new paradigm for AI — one that authentically celebrates and elevates linguistic diversity everywhere.
Thank you for joining our journey! To learn more, contribute, or collaborate, please feel free to reach out. You can also directly explore the models from each of our publications. The future of African NLP is collaborative — let’s build it together.