A newly published study in JAMA Ophthalmology suggests that a large language model (LLM) artificial intelligence (AI) system may rival, and even exceed, human ophthalmologists in diagnosing and treating patients diagnosed with glaucoma and retinal disease.
Let’s start with some background.
Investigators noted that the use of LLMs is on the rise, with more being integrated into medical decision-making and patient education.
The potential of LLMs to operate as medical AI catalysts in ophthalmology is no exception. “LLM chatbots have demonstrated encouraging and consistent performances on stimulated Ophthalmic Knowledge Assessment Program examination questions,” the study authors wrote.
And in the case of glaucoma and retina?
Previous research has already explored the diagnostic capabilities of LLM chatbots vs three ophthalmology trainees (for glaucoma) and two retina specialists.
While findings were indicative of LLM’s promise in delivering clinical diagnoses within specific domains, the study authors sought to investigate this comparison in real-life clinical case scenarios.
Which brings us to …
Researchers from the New York Eye and Ear Infirmary (NYEE) of Mount Sinai in New York City, New York, compared an LLM chatbot’s responses with those of fellowship-trained glaucoma and retina specialists to explore the potential of LLMs in clinical ophthalmology.
How did they do this?
The investigators conducted a comparative, single-center, cross-sectional study of 15 participants from the Department of Ophthalmology at the Icahn School of Medicine at Mount Sinai:
- 12 board-certified, fellowship-trained subspecialists
- Eight in glaucoma
- Four in retina
- Three ophthalmology trainees
- Two fellows
- One senior resident
Respectively, the mean and median practice duration was 11.7 (13.5) years and 6 (19.6) years.
How were the questions selected?
A total of 20 questions (10 each for glaucoma and retina) from the American Academy of Ophthalmology (AAO) were randomly selected based on its list of commonly asked questions.
An additional 20 deidentified glaucoma and retinal patient cases were selected from Mount Sinai-affiliated clinics.
To note, these cases were previously pooled to be balanced in both complexity and diversity.
Now explain the LLM chatbot’s role in this.
Researchers used the May 2023 version of GPT-4 (OpenAI), an advanced LLM originally launched in 2022, to respond to all 30 questions.
They defined its role in this study to be “a medical assistant, delivering concise answers to emulate an ophthalmologist’s response.”
And how were answers rated/compared?
Accuracy was based on a 10-point Likert scale, with a score of 1 or 2 indicating poor or unacceptable accuracies, while medical completeness was evaluated on a 6-point scale (score of 1 or 2 indicating an incomplete response).
Researchers evaluated and compared answers to all questions and cases by GPT-4 and the human ophthalmologists; secondary measurements included rating differences between the trainees and attendings “to assess whether the level of training influenced the perception of the LLM’s responses.”
How was any potential bias minimized?
Questions were presented to participants in a randomized order.
Further, both human ophthalmologists and the LLM chatbot were instructed to “respond in a consistently structured bullet-point format for clarity and coherence.”
So what were the findings?
Overall, the LLM chatbot was found to either match or outperform the human ophthalmologists for both accuracy and completeness of its medical advice and assessments, according to Mount Sinai.
In fact, “AI demonstrated superior performance in response to glaucoma questions and case-management advice,” a news release stated, “while reflecting a more balanced outcome in retina questions, where AI matched humans in accuracy but exceeded them in completeness.”
Talk numbers.
For glaucoma specialists vs GPT-4, the combined question-case mean rank for accuracy was:
- 506.2 (LLM chatbot)
- 403.2 (specialists)
And for completeness:
- 528.3 (LLM chatbot)
- 398.7 (specialists)
For retina specialists vs GPT-4, the mean rank for accuracy was:
- 235.3 (LLM chatbot)
- 215.1 (specialists)
And for completeness:
- 258.3 (LLM chatbot)
- 208.7 (specialists)
Was there a difference between the specialist groups?
Yes, save for an exception that was noted between specialists vs trainees in rating chatbot completeness.
However, the study authors stated that the overall comparison of pairs indicated that both groups rated the accuracy and completeness of the LLM chatbot “more favorably than those of their specialist counterparts, specialists noting a significant difference in the chatbot’s accuracy (z = 3.23; P = .007) and completeness (z = 5.86; P < .001).”
And how about versus the chatbot?
Comparing the LLBM chatbot to each specialist category, the chatbot’s performance vs retina specialists “showed a more balanced outcome, matching them in accuracy but exceeding them in completeness.”
Any limitations?
The authors noted the single-center, cross-sectional study design that evaluated LLBM’s capabilities at just one time point vs one group of specialists; they recommended a longitudinal, multi-centered evaluation with a larger dataset might offer additional information.
Additionally, the chatbot’s “unclear limitations in complex decision-making,” among other factors, was not covered in the study.
Expert input?
Per Louis R. Pasquale, MD, FARVO, the study’s senior author, the surprising findings of the chatbot’s proficiency in handling patient cases and matching both the accuracy and completeness of human ophthalmologists in a clinical note format points to a promising future for AI in ophthalmology.
“It could serve as a reliable assistant to eye specialists by providing diagnostic support and potentially easing their workload, especially in complex cases or areas of high patient volume,” according to Dr. Pasquale, deputy chair for Ophthalmology Research for the Department of Ophthalmology.
Lastly… significance?
Although Dr. Pasquale noted that additional testing is needed, the potential for AI to be integrated into standard ophthalmology “could result in quicker access to expert advice, coupled with more informed decision-making to guide their treatment,” he stated.