HomeArticleHealthcareSynthetic Data: The Savior of AI in Healthcare?

Synthetic Data: The Savior of AI in Healthcare?

By Dr. Amit K Shah, PhD, Founder, GNS-AI LLC

Healthcare data comes at a premium for data scientists and AI researchers. However, HIPAA, GDPR, and other laws and regulations provide a barrier in accessing this data in the interest of protecting patient privacy. The information is valuable and is mission-critical to the advancement of healthcare. Even when there is data, there is often a data quality issue, where data from multiple sources are incomplete, imbalanced, or sparse. Synthetic Data is now becoming an increasingly viable way to circumvent many of these challenges. However, one must be aware that not all synthetic data is created equal, and there are particular methodologies and controls that need to be implemented before properly leveraging it to train AI models in healthcare.

First off, we must define what we mean by synthetic data. Synthetic data is data that was not collected or observed; it is generated data from an algorithm or model. There are several methods that are most often used to generate synthetic data. Each method carries with it advantages and disadvantages.

The first, most well-known method, is through simulation methods. Simulations can be done at varying levels of resolution in healthcare: from the system-level to the molecular level, and all have varying degrees of complexity. Furthermore, simulations are often process-based, and thus tend to model causal interactions. Some of the most popular types of simulations involve digital twins, which are digital representations of physical entities. Digital twins are becoming increasingly useful in clinical trials as synthetic controls to compare against an experimental group. They can be used to receive data and inputs from external real-world entities, but they can also be used to generate synthetic, or simulated, data.Simulated data may be based on existing data, but it preserves privacy because simulated data is an artificial construct and cannot be traced back to a particular individual. However, complex simulations can be time-consuming to create, and have additional overhead in terms of computational complexity, requiring high-performance computing to generate high volumes of data. The advantage is that the data produced through simulations is often causally driven, and transparent in terms of how and why the data was generated. By extension, this data also often can be trusted to be used for labeling purposes for supervised learning of machine learning models. Furthermore, simulated datapoints can be generated for hypothetical scenarios.

The second method is with machine learning methods to identify and interpolate between data samples. Such methods include Synthetic Minority Oversampling Techniques, otherwise known as SMOTE, and Adaptive Synthetic or ADASyn. These methods work generally by finding data elements that are neighbors to a datapoint of interest, and interpolating between the data point of interest and the neighbors to construct “synthetic” data samples. These methods are strictly related to generating tabular data. Furthermore, they assume that the decision boundaries are contiguous and do not intersect along the direct path to the neighboring samples, i.e. the decision boundary local to the samples of interest are linear. High volumes of data can be generated quickly, but these can also have increasing errors in mislabeled data points with complex decision boundaries, and synthetic data that comes from these data points can also be increasingly unrealistic, with data points possessing combinations of feature values that would never occur in reality. Furthermore, there is an increased reliance on the quality of the data used to generate the synthetic data; the poorer the data quality, the more dangerous it is to build synthetic datapoints as they would be increasingly likely to be deleterious to AI models that are using synthetic datapoints for training. Synthetic datapoints can be traced back

Finally, we have deep learning models, which are generative in nature, which are increasingly gaining prominence and attention. Generative models learn the latent space of existing data to produce new data. Deep learning methods include Generative Adversarial Networks (GAN’s) and Variational Autoencoders (VAE’s). The basic premise here is that the deep neural network model (GAN or VAE) learns a representation of a high dimensional dataset and encodes this in a smaller dimension and then finds a way through another network to reconstruct the original data. En route to this, small tweaks and modifications can cause these deep neural networks to generate synthetic datapoints. However, as is the case with SMOTE and ADASyn, some of these tweaks and modifications can yield synthetic data points that are unrealistic, or incorrectly labeled. There is also a reliance on the data quality of the original dataset. Additionally, as in the case of SMOTE and ADASyn, the poorer the quality of the dataset, the greater the chance that the synthetic data points are not suitable for AI modeling attempts. Furthermore, these deep neural networks can be harder to train, but also can generate a high volume of data in a short amount of time. Finally, it is not immediately apparent how the synthetic datapoints were created and what data was the source from which the synthetic data emerged.

Synthetic data samples can be useful for a variety of reasons: correcting imbalanced data by synthesizing new samples for the minority or under-represented class to build a classification model, for testing purposes in production systems, for masking patients and protecting their privacy, for exposing and probing AI model weaknesses through testing with synthetic datapoints, and for generating hypothetical scenarios. However, the method of synthetic data generation matters greatly. Practitioners must understand and consider the limitations of each framework when generating synthetic data. Furthermore, they must implement proper protocols to ensure quality simulated data, or their machine learning and artificial intelligence enterprise may be at risk.

BIO: 
Dr. Amit Shah received his Ph.D in Biomedical Engineering from the University of Illinois at Chicago. A native New Yorker, Dr. Shah is a computer scientist, trained neuroscientist, and A.I. researcher.

He served in the industry for over 5 years as a Data Scientist. His most recent role was in Abbott Diagnostics Division as the Data Science Manager. He has won corporate-wide hackathons and worked on digital twins, simulations, computer vision and natural language processing models, and conducted studies on rehabilitation of stroke and traumatic brain injury survivors using virtual reality and augmented reality devices. He has programmed haptic forces in robots and virtual reality interactions, and did the majority of his Ph.D. research at the Shirley Ryan Ability Lab (formerly known as the Rehabilitation Institute of Chicago).

Currently, Dr. Shah is the president and owner of GNS-AI LLC, an emerging technologies firm that specializes in accelerating A.I. and data analytics integration in corporations, healthcare institutions, and government agencies. GNS-AI’s chief mission is to treat data as a strategic asset and to actualize the potential of that data through Artificial Intelligence, delivering the insights to the critical person at the right time.

Must Read

Related News