Daire Crawford, PhD candidate, Faculty of Applied Science & Engineering

Supervised by Dionne Aleman, Faculty of Applied Science & Engineering and Laura Rosella, Dalla Lana School of Public Health

Project Title: Empirical Evaluation of Data Re-Balancing Methods for Predicting COVID-19 Spread

Project Summary: Data imbalance is a phenomenon that occurs when one data class (i.e., minority class) occurs significantly less than another (i.e., majority class). Imbalances can occur in any dataset but is a particularly common issue in healthcare data, as instances of disease tend to be rare relative to the general population. This imbalance poses a challenge in Machine Learning (ML) for disease transmission, as the lack of data in the minority class makes it difficult to accurately predict cases of disease.

Numerous data re-balancing methods exist, including random or intelligent re-sampling, or cost-sensitive learning methods. However, most studies investigating the effectiveness of different re-balancing methods use C4.5 decision trees ML models, due to their ease of implementation and interpretation, while few studies have investigated re-balancing methods for other ML model architectures. Therefore, little is known about the efficacy of different re-balancing methods for other common ML models.

Using COVID-19 data generated from the Medical Operations Research Lab's Pandemic Outbreak Planner agent-based simulation, this study aims to provide insight into best practices for handling data imbalance in healthcare data via comparison of different re-balancing methods. These methods will be evaluated based on their ability to improve model accuracy across different ML architectures.