DPFree Data Generation
Sayash Raaj
Unknown
- 0 Collaborators
Obtain a differentially private dataset mirroring the original while safeguarding individual privacy ...learn more
Project status: Under Development
oneAPI, Artificial Intelligence
Intel Technologies
oneAPI,
Intel CPU,
Intel Python
Overview / Usage
Data is the lifeblood of modern artificial intelligence. Getting the right data is both the most important and the most challenging part of building powerful AI. Collecting quality data from the real world is complicated, expensive and time-consuming. This is where synthetic data comes in.
Synthetic data is information that's artificially manufactured rather than generated by real-world events. Synthetic data is created algorithmically, and it is used as a stand-in for test datasets of production or operational data, to validate mathematical models and, increasingly, to train machine learning models.
The project uses Intel Hardware- Intel i5 Processor on a macbook pro 2015, and scope for boosting performance using Intel Software- Intel® Optimization for TensorFlow*, Intel® Optimization for PyTorch*, Intel® Extension for Scikit-learn*, among others
Differential privacy (DP) is a system for publicly sharing information about a dataset by describing the patterns of groups within the dataset while withholding information about individuals in the dataset. A differentially private synthetic dataset looks like the original dataset - it has the same schema and attempts to maintain properties of the original dataset (e.g., correlations between attributes) - but it provides a provable privacy guarantee for individuals in the original dataset.
Using the Synthetic Observation Generation with differential privacy using GANs (SOGDPG) solution, we obtain complete privacy risk-free shareable synthetic observations.
Medical records, the most sensitive data of all industries can be shared freely when generated using SOGDPG, as all the private information is removed and replaced with indistinguishable synthetically generated observations.
In India, the Covid-19 pandemic had disturbed the nation with its unprecedented vicious nature. Multiple attempts to build software solutions had hit a roadblock because of unavailability of data from sources due to inefficient logistics, sensitivity of data and the prioritisation of workforce focusing on handling the pandemic. This could all be solved with SOGDPC, where the already available data could generate multiple synthetic and indistinguishable datasets for medical software solutions to work upon
Methodology / Approach
The project uses Intel Hardware- Intel i5 Processor and scope for boosting performance using- Intel® Optimization for TensorFlow*, PyTorch*, Intel® Extension for Scikit-learn*
Using the Synthetic Observation Generation with differential privacy using GANs solution, we obtain a differentially private synthetic dataset which has the same schema and attempts to maintain properties of the original dataset (e.g., correlations between attributes) - but provides provable privacy guarantee for individuals in the original dataset.
Technologies Used
The project uses Intel Hardware- Intel i5 Processor and scope for boosting performance using- Intel® Optimization for TensorFlow*, PyTorch*, Intel® Extension for Scikit-learn*
Intel Machine learning Framework- scikit-learn, TensorFlow
Intel® Distribution for Python* with highly optimized scikit-learn*
Intel® Extension for TensorFlow*