Workshop

2nd Workshop on Practical ML for Developing Countries: Learning Under Limited/low Resource Scenarios

Esube Bekele, Waheeda Saib, Timnit Gebru, Meareg Hailemariam, Vukosi Marivate, Judy Gichoya

Abstract:

The constant progress being made in artificial intelligence needs to extend across borders if we are to democratize AI in developing countries. Adapting the state-of-the-art (SOTA) methods to resource constrained environments such as developing countries is challenging in practice. Recent breakthroughs in natural language processing (NLP), for instance, rely on increasingly complex and large models (e.g. most models based on transformers such as BERT, VilBERT, ALBERT, and GPT-2) that are pre-trained in on large corpus of unlabeled data. In most developing countries, low/limited resources means hard path towards adoption of these breakthroughs. Methods such as transfer learning will not fully solve the problem either due to bias in pre-training datasets that do not reflect real test cases in developing countries as well as the prohibitive cost of fine-tuning these large models. Recent progress with focus given to ML for social good has the potential to alleviate the problem in part. However, the themes in such workshops are usually application driven such as ML for healthcare and for education, and less attention is given to practical aspects as it relates to developing countries in implementing these solutions in low or limited resource scenarios. This, in turn, hinders the democratization of AI in developing countries. As a result, we aim to fill the gap by bringing together researchers, experts, policy makers and related stakeholders under the umbrella of practical ML for developing countries. The workshop is geared towards fostering collaborations and soliciting submissions under the broader theme of practical aspects of implementing machine learning (ML) solutions for problems in developing countries. We specifically encourage contributions that highlight challenges of learning under limited or low resource environments that are typical in developing countries.

Chat is not available.

Timezone: »

Schedule

Fri 7:00 a.m. - 7:10 a.m.
Welcome Remarks
Fri 7:10 a.m. - 7:45 a.m.
AI in Emerging Markets: Governance, Infrastructure, Language (Invited Talk)
Mark Weber
Fri 7:45 a.m. - 8:00 a.m.
Zero-shot spoken language understanding for English-Hindi: An easy victory against word order divergence (Contributed Talk)   
Judith Gaspers
Fri 8:00 a.m. - 8:15 a.m.
Automated Detection of Food Water-Borne Parasites in Low Cost Smartphone Microscope Image (Contributed Talk)   
Bishesh Khanal
Fri 8:15 a.m. - 9:18 a.m.
Poster Session (Poster Session+ Coffee Break)
Fri 8:15 a.m. - 8:22 a.m.
  

Drawing inferences from a spermatozoon (Sperm Cell) image based on its morphology is ubiquitous, challenging, and of substantial practical interest. In the present study, we endeavour to deconstruct and demonstrate a framework to distinguish between the binary classes, which constitutes 'Good' (Fertile) and 'Bad' (Infertile) Sperm Cell images. We have selected the DenseNet121 architecture to train our model for this task, the reason for which is examined in Section 2.3. Furthermore, Conditional Deep Convolutional Generative Adversarial Networks (cDCGAN) was used to tackle the minority Class imbalance problem, which was heavily prominent in the dataset chosen for this task as seen in Section 2.2. We have hand-picked numerous statistical inferential tests and metrics to validate our model to accentuate the reliability of the obtained results, thus finally formulating and delineating a table based on the respective `Quality Scores' of the test samples provided. With the cDCGAN training data augmentation, the test-set accuracy was recorded to be 86.2%, while the model without cDCGAN scored only 24.3%.

Dipam Paul
Fri 8:22 a.m. - 8:29 a.m.
  

One of the most serious public health problems in Peru and worldwide is Tuberculosis (TB), which is produced by a bacterium known as Mycobacterium tuberculosis. The purpose of this work is to facilitate and automate the diagnosis of tuberculosis using the MODS method and using lens-free microscopy, as it is easier to calibrate and easier to use by untrained personnel compared to lens microscopy. Therefore, we employed a U-Net network on our collected data set to perform automatic segmentation of cord shape bacterial accumulation and then predict tuberculosis. Our results show promising evidence for automatic segmentation of TB cords, and thus good accuracy for TB prediction.

Dennis Hernando Núñez Fernández
Fri 8:29 a.m. - 8:36 a.m.
  

The new coronavirus 2019 (COVID-2019), which first appeared in the city of Wuhan in China in December 2019, quickly spread around the world and became a pandemic. It has had a devastating effect on both daily life, public health and the global economy. It is critical to detect positive cases as early as possible to prevent the further spread of this epidemic and to treat affected patients quickly. The need for auxiliary diagnostic tools has increased as accurate automated tool kits are not available. This paper presents a work in progress that proposes the analysis of images from ultrasound scans using a convolutional neural network, and that system will be implemented on a Raspberry Pi.

Dennis Hernando Núñez Fernández
Fri 8:36 a.m. - 8:43 a.m.
  

Predictive models have become increasingly ubiquitous in our society. However, concern has been expressed on their ability to perpetuate inequality amongst subpopulations. Active feature-value acquisition has been suggested as a method of promoting both individual and group notions of fairness in a predictive model. In this work, we seek to use such active framework to create a predictive socioeconomic model. At the same time, satellite imagery has been utilized as a method of socioeconomic estimation. Our goal is to integrate satellite imagery with an active framework to create a fair predictive socioeconomic model. This was tested on one real-world dataset. Results indicate an increase in accuracy resulting from the aggregation of the satellite imagery.

Kush R Varshney
Fri 8:43 a.m. - 8:50 a.m.
  

Distant supervision allows obtaining labeled training corpora for low-resource settings where only limited hand-annotated data exists. However, to be used effectively, the distant supervision must be easy to gather. In this work, we present ANEA, a tool to automatically annotate named entities in text based on entity lists. It spans the whole pipeline from obtaining the lists to analyzing the errors of the distant supervision. A tuning step allows the user to improve the automatic annotation with their linguistic insights without labelling or checking all tokens manually. In six low-resource scenarios, we show that the F1-score can be increased by on average 18 points through distantly supervised data obtained by ANEA.

Michael Hedderich
Fri 8:50 a.m. - 8:57 a.m.
  

Democratizing access to Artificial Intelligence and truly utilizing it for the common
good requires multi-stakeholder AI competitions focused on real and prevalent
problems with the potential for large scale impact and that promotes and
ensures the explainability, reproducibility, contextualization and incremental enhancements of solutions. We propose a solution documentation and problem documentation template for AI/Machine Learning competitions that ensures the identification and systematic characterization of prevalent problems and the documentation of developed solutions in such a way that they can be easily utilized by anyone anywhere.

Olubayo Hamzat
Fri 8:57 a.m. - 9:04 a.m.
  

In this work, we propose a data-driven scheme to initialize the parameters of a deep neural network. This is in contrast to traditional approaches which randomly initialize parameters by sampling from transformed standard distributions. Such methods do not use the training data to produce a more informed initialization. Our method uses a sequential layer-wise approach where each layer is initialized using its input activations. The initialization is cast as an optimization problem where we minimize a combination of encoding and decoding losses of the input activations, which is further constrained by a user-defined latent code. The optimization problem is then restructured into the well-known Sylvester equation, which has fast and efficient gradient-free solutions. Our data-driven method achieves a boost in performance compared to random initialization methods, both before start of training and after training is over. We show that our proposed method is especially effective in few-shot and fine-tuning settings. We conclude this paper with analyses on time complexity and the effect of different latent codes on the recognition performance.

Buna Das
Fri 9:04 a.m. - 9:11 a.m.
  

In this work, we propose a method blending representation learning and molecular docking to predict protein ligand interaction, a key building block of drug repurposing and discovery. Using Leishmaniasis as a case study, we analyze the speed-accuracy trade-off that representation learning methods provide when compared to more computationally intensive molecular docking methods. We find that while deep learning methods substantially reduce the screening burden for molecular docking by a factor of 600, they can not be trusted to find the top ligands binding to a given target. This suggests that current deep learning methods can be used to come up with a short list of most promising ligands but the final predictions should rely on molecular docking.

Hassan Kane
Fri 9:11 a.m. - 9:18 a.m.
  

For easier communication, posting, or commenting on each others posts, people use their dialects. In Africa, various languages and dialects exist. However, they are still underrepresented and not fully exploited for analytical studies and research purposes. In order to perform approaches like Machine Learning and Deep Learning, datasets are required. One of the African languages is Bambara, used by citizens in different countries. However, no previous work on datasets for this language was performed for Sentiment Analysis. In this paper, we present the first common-crawl-based Bambara dialectal dataset dedicated for Sentiment Analysis, available freely for Natural Language Processing research purposes.

chayma fourati
Fri 9:25 a.m. - 10:00 a.m.
Optimizing for Human-centric AI in the Global South (Invited Talk)   
Chinasa Okolo
Fri 10:00 a.m. - 10:15 a.m.
Efficient Click-Through Rate Prediction for Developing Countries via Tabular Learning (Contributed Talk)   
Buru Chang, Joonyoung Yi
Fri 10:20 a.m. - 11:20 a.m.
Research and Challenges of ML/AI against COVID-19 and Climate Change in the context of Developing Countries (Panel Discussions)
Fri 11:20 a.m. - 11:40 a.m.
Closing Remarks