Africa has over 2000 languages and yet is one of the least represented in NLP research. The rise in ML community efforts on the African continent has led to a vibrant NLP community. This interest is manifesting in the form of national, regional, continental and even global collaborative efforts focused on African languages, African corpora, and tasks with importance to the African context. Starting in 2020, the AfricaNLP workshop has become a core event for the African NLP community. Many of the participants are active in the Masakhane grassroots NLP community members, allowing the community to convene, showcase and share experiences with each other. Many first-time authors, through the mentorship programme, found collaborators and published their first paper. Those mentorship relationships built trust and coherence within the community that continues to this day. We aim to continue this.Large scale collaborative works have been enabled by participants who joined from the AfricaNLP workshop such as MasakhaNER (61 authors), Quality assessment of Multilingual Datasets (51 authors), Corpora Building for Twi (25 authors), NLP for Ghanaian Languages (25 Authors).This workshop follows the previous successful edition in 2020 and 2021 co-located with ICLR and EACL respectively.
Fri 2:00 a.m. - 2:05 a.m.
|
Opening Remark
|
🔗 |
Fri 2:05 a.m. - 2:50 a.m.
|
Morning Keynote: Convenience, Random or Purposive Sampling: African Languages and Global NLP - Túndé Adégbọlá
(invited talk)
|
🔗 |
Fri 2:50 a.m. - 3:00 a.m.
|
Q&A for Morning Keynote
(Q&A)
|
🔗 |
Fri 3:00 a.m. - 3:30 a.m.
|
Invited Talk: Low-resource natural language processing - Cristina España-Bonet
(invited talk)
|
🔗 |
Fri 3:30 a.m. - 3:45 a.m.
|
Q&A for Invited Talk
(Q&A)
|
🔗 |
Fri 3:45 a.m. - 3:55 a.m.
|
Spotlight Talk 1 :Analysing the effects of transfer learning on low-resourced named entity recognition performance
(Spotlight)
link »
Transfer learning has led to large gains in performance for nearly all NLP tasks while making downstream models easier and faster to train. This has also been extended to low-resourced languages, with some success. We investigate the properties of transfer learning between 10 low-resourced languages, from the perspective of a named entity recognition task, specifically how much adaptive fine-tuning improves performance, the efficacy of zero-shot transfer as well as the effect of learning on the contextual embeddings computed from the model. Our results give some insight into zero-shot performance as well as the impact of different training schemes and data overlap between the training and testing languages. Particularly, we find that models with the best generalisation to other languages suffer in individual language performance, while models that perform well on a single language often do so at the expense of generalising to others. |
🔗 |
Fri 3:55 a.m. - 4:05 a.m.
|
Spotlight Talk 2 : Building Text and Speech Datasets for Low Resourced Languages: A Case of Languages in East Africa
(Spotlight)
link »
Africa has over 2000 languages; however, those languages are not well represented in the existing Natural Language Processing ecosystem. African languages lack essential digital resources to be engaged effectively in the advancing language technologies. This growing gap has attracted researchers to empower and build resources for African languages to transfer the various Natural Language Processing methods to African languages. This paper discusses the process we took to create, curate and annotate language text and speech datasets for low-resourced languages in East Africa. This paper focuses on five languages. Four of the languages: Luganda, Runyankore-Rukiga, Acholi, and Lumasaaba, are majorly spoken in Uganda, and Kiswahili which is a majorly spoken language across East Africa. We have run baseline: machine translation models on the English - Luganda dataset in the parallel text corpora and Automatic Speech Recognition (ASR) models on the Luganda speech dataset. We recorded a BiLingual Evaluation Understudy (BLEU) score of 37 for the English-Luganda model and a BLEU score of 36.8 for the Luganda-English model. For the ASR experiments, we obtained a Word Error Rate (WER) of 33%. |
🔗 |
Fri 4:05 a.m. - 4:50 a.m.
|
Poster Section (Gather Town) (poster) link » | 🔗 |
Fri 4:50 a.m. - 5:30 a.m.
|
Social + Break
(Break)
|
🔗 |
Fri 5:30 a.m. - 6:15 a.m.
|
Afternoon Keynote - Joyce Nakatumba-Nabende
(invited talk)
|
🔗 |
Fri 6:15 a.m. - 6:30 a.m.
|
Q&A for Afternoon Keynote
(Q&A)
|
🔗 |
Fri 6:30 a.m. - 6:40 a.m.
|
Spotlight Talk 3 : Participatory Translations of Oshiwambo: Towards Sustainable Culture Preservation with Language Technology
(paper presentation)
link »
In this paper, we describe a participatory, collaborative, and cost-effective process for creating translations in Oshiwambo, the most widely African language spoken in Namibia. We aim to (1) build a resource for language technology development, (2) bridge generational gaps in cultural and language knowledge, and at the same time (3) provide socio-economic opportunities through language preservation. The created data spans diverse topics of cultural importance, and comprises over 5,000 sentences written in the Oshindonga dialect and translated to English, the largest parallel corpus for Oshiwambo to-date. We show that it is very effective for machine translation, especially when combined with transfer learningIn the interest of reproducibility, we publicly release our source code and models. |
🔗 |
Fri 6:40 a.m. - 6:50 a.m.
|
Spotlight Talk 4 : Machine Translation For African Languages: Community Creation Of Datasets And Models In Uganda
(paper presentation)
link »
Reliable machine translation systems are only available for a small proportion of the world’s languages, the key limitation being a shortage of training and evaluation data. We provide a case study in the creation of such resources by NLP teams who are local to the communities in which these languages are spoken. A parallel text corpus, SALT, was created for five Ugandan languages (Luganda, Runyankole, Acholi, Lugbara and Ateso) and various methods were explored to train and evaluate translation models. The resulting models were found to be effective for practical translation applications, even for those languages with no previous NLP data available, achieving mean BLEU score of 26.2 for translations to English, and 19.9 from English. The SALT dataset and models described are publicly available at https://github.com/SunbirdAI/salt. |
🔗 |
Fri 6:50 a.m. - 7:00 a.m.
|
Break
|
🔗 |
Fri 7:00 a.m. - 7:45 a.m.
|
Poster Section + Social (Gather Town) (poster) link » | 🔗 |
Fri 7:45 a.m. - 8:45 a.m.
|
Panel: Abeba Birhane, Mona Diab, and Audace Niyonkuru - Moderated by Perez Ogayo
(discussion panel)
|
🔗 |
Fri 8:45 a.m. - 9:00 a.m.
|
Break
|
🔗 |
Fri 9:00 a.m. - 9:45 a.m.
|
Invited Talk - Timnit Gebru
(invited talk)
|
🔗 |
Fri 9:45 a.m. - 10:00 a.m.
|
Q&A for Invited Talk
(Q&A)
|
🔗 |
Fri 10:00 a.m. - 11:00 a.m.
|
Tutorial and Q&A: Zero-resource speech technology with wav2vec - Michael Auli
(invited talk)
|
🔗 |
Fri 11:00 a.m. - 11:25 a.m.
|
Invited Talk and Q&A: Measuring the Representativeness of NLP Datasets - Antonios Anastasopoulos
(invited talk)
|
🔗 |
Fri 11:25 a.m. - 11:30 a.m.
|
Closing remarks - David Adelani
(closing remark)
|
🔗 |