Clinical vignettes in benchmarking the performance of online symptom checkers

In the USA, over one-third of adults self-diagnose their conditions using the internet, including queries about urgent (ie, chest pain) and non-urgent (ie, headache) symptoms. The main issue with self-diagnosing using websites such as Google and Yahoo is that user may get confusing or inaccurate information, and in the case of urgent symptoms, the user may not be aware of the need to seek emergency care. In recent years, various online symptom checkers (OSCs) based on algorithms or artificial intelligence (AI) have emerged to fill this gap

Online symptom checkers are calculators that ask users to input details about their symptoms of sickness, along with personal information such as gender and age. Using algorithms or AI, the symptom checkers propose a range of conditions that fit the symptoms the user experiences. Developers promote these digital tools as a way of saving time for patients, reducing anxiety and giving patients the opportunity to take control of their own health.

The diagnostic function of online symptom checkers is aimed at educating users on the range of possible conditions that may fit their symptoms. Further to presenting a condition outcome and giving the users a triage recommendation that prioritises their health needs, the triage function of online symptom checkers guides users on whether they should self-care for the condition they are describing or whether they should seek professional healthcare support.3 This added functionality could vastly enhance the usefulness of Online symptom checkers by alerting people about when they need to seek emergency support or seek non-emergency care for common or self-limiting conditions.

In a study published in the journal BMJ Open, we assessed the suitability of vignettes in benchmarking the performance of online symptom checkers. Our approach included providing the vignettes to an independent panel of single-blinded physicians to arrive at an alternative set of diagnostic and triage solutions. The secondary aim was to benchmark the safety of a popular online symptom checkers (Healthily) by measuring the extent that it provided the correct diagnosis and triage solutions to a standardised set of vignettes as defined by a panel of physicians.

We found significant variability of medical opinion depending on which group of GPs considered the vignette script, whereas consolidating the output of two independent GP roundtables (one from RCGP and another panel of panel of independent GPs) resulted in a more refined third iteration (the consolidated standard) which more accurately included the ‘correct’ diagnostic and triage solutions conferred by the vignette script. This was demonstrated by the significant extent that the performance of online symptom checkers improved when benchmarked between the original and final consolidated standards.

The different qualities of the diagnostic and triage solutions between iterative standards suggest that vignettes are not an ideal tool for benchmarking the accuracy of online symptom checkers, since performance will always be related to the nature and order of the diagnostic and triage solutions which we have shown can differ significantly depending on the approach and levels of input from independent physicians. By extension, it is reasonable to propose that any consolidated standard for any vignette can always be improved by including a wider range of medical opinion until saturation is reached and a final consensus emerges.

The inherent limitations of clinical vignettes render them largely unsuitable for benchmarking the performance of popular online symptom checkers because the diagnosis and triage solutions assigned to each vignette script are amenable to change pending the deliberations of an independent panel of physicians. Although online symptom checkers are already working at a safe level of probable risk, further work is recommended to cross-validate the performance of online symptom checkers against real-world test case scenarios using real patient stories and interactions with GPs as opposed to using artificial vignettes only which will always be the single most important limitation to any cross-validation study.