Month: April 2021

Using CTGAN to synthesise fake patient data

Being the member of the Computational Oncology lab with no more than my A levels (which I never sat!) and an unconditional offer to study Computer Science at Imperial College London in October 2021 has been a great opportunity which I am extremely grateful for, albeit a bit daunting, sitting alongside everyone with their variety of PhDs.

A large issue in the medical world is that patient data is highly confidential and private, making getting our hands on this limited resource difficult. One potential solution to this problem is using a GAN (Generative Adversarial Network) called  CTGAN to generate realistic synthetic patient data, based off real, private patient data. The goal of a project that I have been working on  is to take in real tabular patient data, train a CTGAN model on the real data and have the model output synthetic data that preserves the correlations of the various columns in the real data. This model can generate as many synthetic patients as one desires, can undergo the same analysis techniques that researchers would use on real patient data, and the synthetic data can be made publicly available as no private data is accessible through the synthetic data.

There are many constraints that can be placed onto the GAN to make the synthetic data more realistic, as sometimes data needs to be constrained. As of right now, there are 4 constraints that can be placed on the model. First, there is the ‘Custom Formula’ constraint, which could be used to preserve the formula of ‘years taking prescription = age – age when prescription started’. Next, the ‘Greater Than’ constraint which would ensure that ‘age’ would always be greater than ‘age when prescription started’ and finally, the ‘Unique Combinations’ constraint which could be used to restrict ‘City’ to only be synthesised when the  appropriate Country column is generated alongside it. There is also a constraint called APII (Anonymising Personally Identifiable Information) which ensures that no private information is copied from the real data to synthetic data. APII works with many different confidential fields such as Name, Address, Country or Telephone Numbers by replacing these fields with pre-set, non-existent data entries from a large database called Faker Providers.

I have been working on adding a new constraint called ‘Custom ID’. This constraint applies when 2 columns are just encodings of each other, which occurs most frequently with ID columns. Without this constraint, the 100% correlation between a discrete column and its’ respective ID column would not be preserved. This is done by, first, comparing each of the columns against every other column in the data, if any encodings are found then the discrete column is preserved and the numeric ID column is removed. This is done because the ID column is an encoding of a discrete column, but will be identified by CTGAN as continuous, and will therefore model the column incorrectly. Once the ID column is removed, a lookup table will be created which links each value in the discrete column to their respective IDs. The CTGAN model is then trained on the data that does not contain the ID column. Finally, when sampling synthetic data, the ID is added backing into the synthetic data using the lookup table.

This solution has the advantage of running quickly, as the time complexity is not based on the number of rows in the real data. It is also easy to use, as it can be turned on and off with one input. Finally, my solution will identify all of the ID columns in the data and create lookup tables for each of them. One limitation of my solution would be that my Custom ID constraint would not detect 3 columns that are correlated (eg. ‘Gender’ ‘Gender Abbreviation’ and ‘Gender ID’) although this situation occurs rarely.

As I have the long-term goal of synthesising patient data, the synthetic data must be secure in the sense that it must not be possible to reverse engineer the real data from the synthetic. One test, available in SDGym, is the LogisticDetection metric which compiles both the real and synthetic data randomly and passes them to a discriminator which attempts to flag the incoming data with real and synthetic flags. This test showed that the data can be correctly identified as real or synthetic just over 50% of the time. However, when we are dealing with analysis on medical data, I feel there are still steps to be taken, to make the synthetic data more accurate before serious analysis on this synthetic data can begin.

Dipping my toes into the world of machine learning has been extremely fascinating and I have learned many new things, about both machine learning and coding more generally. I realise how important the complexity of my code is, because, if the code has a bad time complexity, the program could take days to run, which is not practical. Another lesson I learned the hard way is to not be afraid to restart. I completely rewrote my code after I finished a previous, working solution to the Custom ID constraint because the code was too complex and was taking hours to run. This allowed me to learn from my mistakes and reach a much better solution.

I now hope to train a CTGAN model on the GlioCova dataset which contains medical records of over 50000 cancer patients, and measure how well CTGAN performs on a large, relational database.





COVID and the return to Research

The 14th of March marked the end of my redeployment to a support role in Intensive Care and the second interruption of my PhD due to the pandemic. Academic papers and lines of data were replaced by disembodied voices as I endeavoured to keep two wards worth of family members updated on their loved ones’ progress. With strict restrictions on visitation, this daily conversation was often the only insight into how their relatives were recovering, and in many cases, how they weren’t. Whether I’ll be due a third pandemic-related sabbatical is yet to be seen. In the past few weeks, I’ve personally witnessed the steady downtick of COVID-related admissions. Beds filled with ventilated patients are now replaced by those in need of ITU-level monitoring following delayed essential procedures. Things are no less busy, but the grip of COVID has loosened and at the very least, there is a measure of respite.

Today, I replace my ITU hat with the academic hat I hung up two months ago. My oncology hat continues to gather dust, awaiting its eventual turn. Being reminded of my time as a clinician, one of my motivations for delving into the world of code and computational solutions is in its ability to capture and manipulate data that is often overlooked in day-to-day practice. Medical data is costly, both in time and manpower. Request forms, going through the scanning process, having labs do bloodwork, waiting for a report to be generated, are all steps taken to produce what is often a singular data point, which subsequently is consigned to medical archives. As our technology advances, so too has the information we capture from investigations, as well as our ability to store and read it on a larger integrated scale. This could enable – the discovery of complex relationships that would otherwise not have fit onto a blackboard or spreadsheet. Pairing this with the zeitgeist that is the renewed interest in artificial intelligence, we now have the technology to realise complex manipulation of large datasets at a level previously unattainable; bursting open the barriers that previously held us back.

One venue of unused data lies in opportunistic imaging. Cross-sectional imaging such as Magnetic Resonance Imaging (MRI) or Computed Tomography (CT) are commonly used in cancer care. These scans reconstruct “slices” through the scanned body which clinicians can scroll through to visualise internal structures. In cancer, the main reason to do this is to evaluate how the cancer is responding to a treatment. Simply put, if a cancer is in the lung, the focus of imaging and attention will be in the lung. But scanning the chest for lung cancer patients invariably picks up other organs as well, including the heart, bones, muscle and fatty tissue present in all of us. Unless there is an obvious abnormality in these other organs (such as a grossly enlarged heart, or cancer deposits elsewhere in the chest), these other organs are barely commented on and become unused by-products. There is information to be gained by review of these other organs, but until now there have not been the tools to attempt to fully realize them.

In a landmark 2009 study, Prado et al. measured muscle mass in obese cancer patients using CT scans obtained routinely during their cancer management. Due to its correlation with total muscle mass, the muscle area was measured at the level of the third lumbar vertebrae. This was subsequently corrected for height to a skeletal muscle index. Prado et al. found a relationship between low muscle index and survival, creating the label of sarcopenic obesity with a diagnostic cut-off determined by optimum stratification (<Male 55/Female 39 cm2/m2). Sarcopenia was a new concept in cancer care at that point, previously having been used in mainly an aging context to define frailty associated with poor muscle and mass. In the frailty literature, where there is limited access to CT or MRI imaging, assessments were usually functional (such as defined set of exercises) or involved plain X-ray imaging of limbs for practicality, cost and radiation dose purposes.

For cancer sarcopenia, assessment of muscle index has been repeated by other groups in single-centre studies across a variety of tumour types and geographic locations. Even when correcting for sex and height, there are enough other uncorrected factors that the range of cut-offs for pathological sarcopenia is too wide to be of practical utility (29.6-41 cm2/m2 in women and 36-55.4 cm2/m2 in men). Another limitation lies in that tumour sites do not always share imaging practices. For instance, in the case of brain tumours, there is less of a need to look for extracranial disease and thus, no imaging is available at the level of the third lumbar vertebrae for analysis.

In the current age of personalised medicine, being able to create individualised risk profiles based on the incidental information gained from necessary clinical imaging would add utility to scan results without adding clinical effort. For my PhD, the goal is to overcome this challenge using transfer learning from an existing detailed dataset. We’ve had the fortune to secure access to the UK Biobank, a medical compendium of half a million UK participants. The biobank includes results of biometrics, genomics, blood tests, imaging as well as medical history. Such a rich dataset is ripe for machine learning tasks.

I have therefore been working to integrate several high-dimensional datasets, applying a convolutional neural network to the imaging aspect whilst a deep neural network to the non-imaging aspect. A dimensionality reduction technique such as autoencoder will subsequently have to be applied to generate a clinically workable model. Being a clinician primarily, I am able to bring clinical rationale to the model and intuit the origin of certain biases from my prototype pipelines. On the flip side, I have struggled to become fluent with the computational code necessary to tackle these problems, and often still feel like I am at the equivalent level of asking for directions to the bathroom back in my GCSE German days.

In becoming a hybrid scientist, I’ve long since acknowledged that I will not be the best coder in the room. I am still climbing the steep learning curve of computer languages and code writing, grateful for this opportunity to realise my potential in this field. Machine learning is ever encroaching on not just our daily lives but in our clinical practice, usually for the better. I imagine that as early adopters turn into the early majority, those of us who have chosen to embrace this technology will be in a position to better develop and understand the tools that will benefit our future patients. After all, soon we will not just be collaborating with each other but also with Dr Siri and Dr Alexa.