Prediction Of Epidemic disease dynamics using Machine Learning
Abhijit Gupta
Pune, Maharashtra
- 0 Collaborators
Reliable predictions of infectious disease dynamics can be valuable to public health organisations that plan interventions to decrease or prevent disease transmission. With the big data growth in healthcare and biomedical sector, accurate analysis of such data could help in early disease detection and better patient care. With the availability of huge computational power at hand, it is now very much viable to exploit the ‘big data’ for predicting and managing an epidemic outbreak. Our idea is to analyse and determine the spread of epidemic diseases in villages and sub-urban areas, where healthcare might not be readily available. We will build a machine learning model that could predict the epidemic disease dynamics and tell us where the next outbreak of epidemic would most likely be. ...learn more
Project status: Published/In Market
Intel Technologies
MKL,
DAAL,
Intel CPU
Overview / Usage
Given an area where an epidemic outbreak has occurred, our ML model should be able to identify next outbreak prone areas and identify features which contribute significantly in the spread of the outbreak. Our approach will also take into consideration the geography, climate and population distribution of an affected area, as these are relevant features and subtly contribute to epidemic disease dynamics. Our model would be beneficial for the healthcare authorities by assisting them take the appropriate action in terms of assuring that sufficient resources are available to suffice the need and, if possible, curbing the occurrence of such epidemic disease.
Topic of case study: 2015 - 2016 Zika virus epidemic
Why Zika ?
*Initially, we wanted to study and build the model for Nipah Virus or Ebola Virus. However, owing to paucity of available reported data and short amount of time, we couldn’t do so.
*Zika Data Repository maintained by Centre for Disease Control and Prevention contains publicly available data for Zika epidemic. It had enough data for building and testing our model.
*Epidemics of infectious disease are generally caused by several factors including a change in the ecology of the host population, change in the pathogen reservoir or the introduction of an emerging pathogen to a host population.
*The feature vectors in our model are general enough to be adapted with a slight change to study any epidemic disease
Methodology / Approach
Data Collection for designing feature vectors:
We have considered the following factors:
- Location and Date
- Population density of an area
- Weather data for upto two weeks prior epidemic(temperature, precipitation, wind, dew point)
- Economic profile (GDP, GDP PPP)
- Vector agent (Mosquitoes for Zika) population
- Proximity of an area with other populated regions
Data Sources:
- · Zika Data Repository maintained by Centre for Disease Control and Prevention contains publicly available data for Zika epidemic. ( https://github.com/cdcepi/zika)
- · Google Geolocation API for procuring latitude and longitude of places associated with outbreak
- · Worldwide airport location data retrieved from Falling rain
- · The weather data scraped from Wunderground.com by nearest airport code
- · Population density of different regions was extracted from gridded map via NASA (SEDAC) (https://earthdata.nasa.gov/about/daacs/daac-sedac)
- · Vector agents (Aedes albopictus, Aedes aegypti) occurrences from The global compendium of _Aedes aegypti_and Ae. albopictus occurrence (https://datadryad.org/resource/doi:10.5061/dryad.47v3c)
- · GDP/ GDP PPP data from IMF World Economic Outlook
The Evaluation outcome is the likelihood of an area to have an outbreak.
All the above data is scraped from different sources via our web scraping script. It’s then cleaned and structured into Pandas DataFrame. The parent sources for the above data and relevant program files for scraping data are mentioned at the end.
Preprocessing and adjusting class imbalance:
Data pre-processing involves the transformations being applied to the data before feeding it to the algorithm. Since a few of the variables in the dataset are categorical, various techniques need to be applied for converting the categorical to numerical variables. Particularly, for Zika cases reported in the CDC database, there was huge class imbalance that became apparent during preliminary analysis This was due in part to the fact that most locations did have outbreaks and most of these outbreaks were ongoing (present at all dates) throughout the span of the available data.
Feature Selection: Datasets may contain irrelevant or redundant features that might make the machine-learning model more complicated. In this step, we aim to remove the irrelevant features which may cause an increase in run time, generate complex patterns, etc. The generated subset of features is used for further analysis. The feature selection can be done either by using Random Forest or Xgboost algorithm. In our project, the Xgboost algorithm is used to select the best features which has a score above a predefined threshold value. Our findings agree with the literature on Zika epidemic1 The temperature, rainfall, proximity to mosquito breeding area, population density and vicinity to other places with large human population(measured via airport_dist_large) plays a significant role in spread of epidemic. Data Split: Splitting the train and test data: The data is then split into train and test sets for further analysis. 70% of the data is used for training and 30% is for testing. The StratifiedShuffleSplit(n_splits=1, test_size=0.3, random_state=0) function in scikit-learn is used for data splitting. Stratified Splitting is required to handle class imbalance between Zika cases and non- zika cases. Stratified splitting maintains the ratio of positive and negative cases of the total sample in train and test sets. Model Building, Prediction and Evaluation:
scikit-learn with Intel DAAL Balancing the dataset The dataset is highly imbalanced with 86% of the data containing positive Zika cases. This data imbalance is handled by the SMOTETomek(SMOTE + Tomek)* algorithm, which generates the new smoted dataset that addresses the unbalanced class problem. It artificially generates observations of minority classes using the nearest neighbors of this class of elements to balance the training dataset. It combines over- and under-sampling using SMOTE and Tomek links. Model Building and Training In this stage, machine-learning models are selected for training. All classifiers in scikit-learn use a fit (X, y) method to fit the model for the given train data X and train label y. To compare the performance of various models, an ensemble of classifiers is used. Once the model is trained, it can be used for prediction. We tested ADABoost, XGBoost, SVM, Multi Layer Perceptron, Logistic Regression. Significant performance gain is observed when we use DASK with Intel TBB for tuning hyperparameters via GridSearchCV Prediction During this stage, the trained model predicts the output for a given input based on its learning. That is, given an unlabeled observation X, predict (X) returns the predicted label y. Evaluation In order to measure the performance of model, various performance evaluation metrics are available. We have used accuracy, precision, and recall as our evaluation metrics to choose the best model for the problem.
Technologies Used
Implementation Details:
We have used Intel Distribution for Python* and Python API for Intel Data Analytics Acceleration Library (Intel DAAL)—named PyDAAL—to boost machine-learning and data analytics performance. Using the advantage of optimized scikit-learn* (Scikit-learn with Intel DAAL) that comes with it, we were able to achieve good results for the prediction problem.
We have used Intel Distribution For Python3 installed via conda (conda create -c intel -n idp intelpython3_full python=3)
Python API for Intel DAAL-PyDAAL
All the pre-processing steps till Step 4 (data split) are the same for PyDAAL implementation. Every algorithm in DAAL accepts inputs in the form of NumericTables, a generic data type for representing data in memory. Since all the converted features are of the same type, we have used HomogenNumericTables for representation. The dataset obtained after feature selection in scikit-learn, is in the form of NumPy ndarray. This can be easily converted to a HomogenNumericTable using the built-in function in the DAAL data management module.
DAAL provides significant speed up compared to non-optimised scikit version.
We implemented ADABoost, BrownBoost, SVM using PyDAAL.
Intel Distribution for Python offers Intel MKL accelerated packages like NumPy, SkLearn, etc. PyDAAL—to boost machine-learning and data analytics performance.
Further, we made use of Dask
Dask.distributed is a lightweight library for distributed computing in Python.
We made use of it in our core program, which was to do GridSearchCV for the best estimator - XGBoost. (Gradient Tree Boosting). We used it as it offers a distributed gradient boosting library designed to be highly efficient, flexible and portable.
XGBoost classifier is run with the parameter njobs = -1(for using max threads)
To implement it we launched a local Dask.distributed client on our local machine
import dask
from dask.distributed import Client
client = Client() # without parameters means running locally
from dask_ml.model_selection import GridSearchCV, RandomizedSearchCV
This offers a substantial improvement in performance when we have to implement a complex pipeline that applies a series of transformations on the input data like (Normalisation, PCA, etc.)
We tried to enable threading composability between two or more thread-enabled libraries. Threading composability can accelerate programs by avoiding inefficient threads allocation (called oversubscription) when there are more software threads than available hardware resources.
Substantial improvement is observed when a task pool like the ThreadPool from standard library or libraries like Dask or Joblib execute tasks calling compute-intensive functions of Numpy/Scipy/PyDAAL and others which in turn are parallelized using Intel MKL or/and Intel Threading Building Blocks (Intel® TBB).
All scripts were run with the modifier flag -m tbb which enables Intel Threading Building Block
Example: python -m tbb /path/to/your/code
Composable parallelism achieved via
Dask and Intel TBB -
Enabling threading composability between two or more thread-enabled libraries.
Threading composability accelerates programs by avoiding oversubscription when there are more software threads than available hardware resources
Repository
https://github.com/abhijitmjj/Prediction-of-epidemic-disease-dynamics-using-Machine-learning-model