It seems there’s more hype about AI than ever before. Just this past week, I shared a link to a study, which designed a deep learning algorithm that can determine race. Radiologists deemed it impossible to do so, but now a deep learning model can.

Although AI is in my opinion a positive addition to medicine, I started to notice studies and people arguing otherwise. Or at least encouraging that we take a step back and evaluate. Whether it’s politics or science, it’s always good to have a contrarian perspective.

The problem with scientific publications

As I pointed out in one of the previous issues, scientific publishing has several problems. Scientists are choosing research projects that are more likely to be funded, which goes hand in hand with the ones that are more likely to be published.

AI in healthcare is the current buzzword (as it is in many other fields). The consequence is that scientific publications are quick to publish anything that mentions “artificial intelligence” or “decision support system” or “deep learning”.

The result is that papers can be easily published for the sake of being published. And by applying AI to virtually anything, the researchers seek to be somewhat original, even though this doesn’t necessarily contribute to usefulness.

Nevertheless, it’s great that we see attempts at applying AI to more medical fields. At the end of the day, the point is to explore what works and what doesn’t. But these ideas stay in medical journals and aren’t tested in the real-world clinical environment.


Sponsored by Dr Ernesto Gutierez

Masterclass: How to Build a Personal Brand As a Doctor

Dr Ernesto Gutierez is a doctor teaching physicians (and medical students) how to create their online personal brand with his course Masterclass: How to Build a Personal Brand As a Doctor. It normally costs $197, but you can use the code MEDICALNOTES to get it for free.

Get the course for free

Introduce your company/product to 210+ medical students and doctors. Become a sponsor.


What actually matters for medicine

Lauren Oakden Rayner is a radiologist and medical AI researcher. In 2019, she published a post on her blog explaining what we’re doing wrong in medical AI safety. I’m recapping (in my opinion) the most important points in the following paragraphs.

She argues that we do three specific things wrong:

  1. We assume good experimental performance equals good clinical performance.
  2. We assume good overall performance equals good subtask performance.
  3. We are not very careful with our study.

One of the most significant points is the distinction between performance testing and clinical outcomes. Performance testing is what we see in clinical research papers. We take a set of patients and decide what we will measure to evaluate our model. Then, we compare the results to the decisions clinicians made using the same data and patients. Take literally any AI paper, and there’s a pretty good chance this is the case.

One of the problems with this is that we control the testing environment. But the actual clinical environment has countless confounding variables. The results are usually different when we apply that same AI model to a real-life clinical environment. And these are the clinical outcomes that matter for the patients and doctors (potentially) using AI.

Performance is not outcomes.

Another proof that performance is not outcomes comes from the USA. The government at one point decided to pay radiologists more money if they evaluated a screening mammogram using a (then named) computer-aided diagnosis (CAD).

Almost immediately, doctors started getting uneasy. In practice, CAD systems would highlight a lot of false positives – areas on the study for the radiologist to review that did not end up being important. It was also variable; if you ran the same study through a CAD system twice, you could get quite different results. To the radiologists, it certainly didn’t appear that these systems were very good, and using them could be frustrating.

After decades of using it, the results were mediocre. They didn’t detect more cancer, and the patients weren’t doing any better. In a sense, this was the clinical outcome of that specific CAD, which was far from its experimental performance.

There are more examples in the blog post, but I thought these two were the most compelling.

What is being done?

When you read a post like this, you think medical AI is doomed. But it’s quite the opposite when you read one praising AI. Nevertheless, doctors and researchers are well aware of the problems with AI. It’s also what the most recent publications show.

To improve the quality of papers about AI, new standards (CONSORT-AI and SPIRIT-AI) for designing clinical trials with AI were introduced back in September 2020.

But just this past week on May 18th a guideline has been published in Nature Medicine, specifically for AI-based decision support systems called DECIDE-AI. The first indications of such guidelines came last February, but it appears that we got an epilogue on this one.

The aim of the guideline is to increase transparency of clinical trials, including AI decision support systems (DSS). They were produced by a group of experts using a Delphi process. The main product of the paper is a checklist that proposes how an AI-based DSS should be clinically evaluated.

💡
A Delphi process is a structured communication technique or method, originally developed as a systematic, interactive forecasting method which relies on a panel of experts. It’s based on the principle that forecasts (or decisions) from a structured group of individuals are more accurate than those from unstructured groups.

The authors state in the paper that the guidelines are focused “on AI systems supporting, rather than replacing, human intelligence” and evaluating the “AI-based decision support systems during their early, small-scale implementation in live clinical settings”.

It seems this is the right direction for DSS clinical trials. As stated above, the main problem AI algorithms have is their application in real-life clinical environment. These guidelines don’t ensure this will change. But they encourage evaluation of DSS in the early stages and focus on how DSS improve clinicians’ performance.

I’m sure there’s a lot more to be said about this topic. Yet, this is what I found interesting and important. In the process of researching for this issue, I also came across a couple of other resources, which I’m sharing below: