Researchers test ChatGPT on retinal disease clinical guidelines

Findings from a study recently published in the Canadian Journal of Ophthalmology assessed how well ChatGPT version 4.0 correctly answered questions related to the management of common retinal diseases using American Academy of Ophthalmology (AAO) Preferred Practice Pattern (PPP) guidelines.

Give me some background.

The AAO has issued PPP guidelines for 12 distinct domains of retinal diseases as benchmarks for standard-of-care protocols to optimize patient outcomes.

About ChatGPT: This is a generative artificial intelligence (AI)-based chatbot created by OpenAI that utilizes natural language processing and machine learning to respond to prompts.

As we previously reported, studies have demonstrated that earlier versions of ChatGPT produced unreliable information about ophthalmic conditions and scored 46% on a test preparation program for ophthalmology board certification.

Bring it back to this study.

Consequently, a research team sought to investigate the proficiency of AI in recognizing and adhering to established clinical guidelines to identify areas of improvement and contribute to the advancement of AI integration in ophthalmology.

Now talk about the study.

In this cross-sectional survey study, investigators designed 130 questions covering a broad spectrum of topics within the 12 AAO PPP domains of retinal disease, such as:

Diagnostic criteria
Treatment guidelines
Management strategies (including medical and surgical retinal care)

Subsequently, a panel of three retina specialists independently evaluated responses from 1-5 based on their relevance, accuracy, and adherence to AAO PPP guidelines.

They also evaluated the readability of responses.

What did they ask ChatGPT?

Each question posed to ChatGPT began with the statement, “I want you to act as an experienced ophthalmologist. Answer the following question using the most up-to-date medical guidelines for retinal specialists.”

Example question: “What is an appropriate list of differential diagnoses for a patient exhibiting signs of nonproliferative diabetic retinopathy (NPDR)?”

Findings?

ChatGPT achieved an overall score of 4.9/5, suggesting high alignment with the AAO PPP guidelines.

Scores varied across domains, with the lowest scores in the surgical management of disease.

Conversely: Three domains received perfect scores from all three raters:

NPDR
Proliferative diabetic retinopathy (PDR)
Posterior vitreous detachment (PVD)
Retinal vein occlusion (RVO)

This was noted as likely reflective of the standardized treatment protocols for these conditions.

Anything else?

The responses had a low reading ease score and required a college to graduate level of reading comprehension.

The study authors also found that five responses (3.8%) contained incorrect or outdated information while 13 responses (10%) lacked important relevant information that may be useful for clinical decision-making.

Of note, identified errors were related to:

Diagnostic criteria
Treatment options
Methodological procedures

Expert opinion?

“The complexity and variability of surgical conditions pose significant challenges, necessitating advancements that support nuanced, patient-specific considerations,” the study authors noted.

Take home.

These findings suggest that ChatGPT 4.0 demonstrated potential in generating guideline-adhering responses—particularly for common medical retinal diseases.

However, its performance decreased in surgical retina care, highlighting the ongoing need for:

Clinician input
Further model refinement
Improved comprehensibility