Curated Big Data for Supervised Machine Learning: A Case of Malaria Indicator and Demographic Health Surveys Data

Abstract:

Quality data remains an indispensable asset of data driven models required for solving analytical problems using machine learning tools and techniques. However, raw data is characterized by feature dependency, class imbalance, outliers, missing data, and so on, which oftentimes degrade the performance of machine learning models. To improve the quality of data for machine learning, data curation is a necessary step to organize and glean relevant features from the big data to generate precise, reliable, and usable information for machine learning models. In the malaria control and eradication programme, precise evidence from quality data could inform accurate assessment of impacts of previous interventions, optimal allocation of resources within the malaria endemic households, formulation of informed policies, and decision-making among others. In this paper, data from the combined features excavated from the Malaria Indicator Survey (MIS) 2015, 2021, and Demographic Health Survey (DHS) 2018, spanning six geopolitical zones in Nigeria are curated to support supervised machine learning. The objective of this task is to provide high-quality data for machine learning models to support the generation of precise evidence for accurate decision-making and informed formulation of policy to address the burden of malaria on households. The implication of obtaining curated data is the reduced feature space suitable for machine learning tasks.

Keywords: Data, Machine Learning, Curate, Feature Engineering, Malaria Endemic Households.