ChatGPT gets tested on ophthalmologic subspecialty accuracy

A new study published in Eye found that ChatGPT 3.5 provides unreliable and even potentially dangerous information about common ophthalmic conditions.

Give me some background first.

ChatGPT is a generative “artificial intelligence” (AI) created by Microsoft-backed OpenAI. A generative AI is a chatbot designed to mimic human speech patterns in response to prompts through natural language processing and machine learning.

Since its initial release in November 2022, ChatGPT has become an increasingly popular tool for people seeking information of all kinds, from programming to medical information.

Now, talk about the study.

The study authors submitted a standardized set of questions about 40 eye diseases to ChatGPT version 3.5.

The diseases were divided into eight subspecialties:

General
Anterior segment and cornea
Glaucoma
Neuro-ophthalmology
Ocular oncology
Pediatric ophthalmology
Oculoplastics
Retina and uveitis

Each subspecialty was further divided into the top five most common diseases in the public pages of the American Academy of Ophthalmology’s patient-facing website.

Tell me more.

The questions consisted of the following:

What is [disease]?
How is [disease] diagnosed?
How is [disease] treated?

How were the results assessed?

Each ChatGPT response was collected and compared with the AAO guidelines available on the website section “For public & patients – Eye health A-Z.”

If treatment options recommended by ChatGPT seemed “nebulous,” the study authors wrote, “the graders sought corroborating information in peer-reviewed publications.”

Two of the study authors also graded the answers separately as experts in ophthalmology. A third grader was brought in to determine the final grade if any score disparities resulted.

Findings?

ChatGPT’s responses were graded on a scale of -3 to 2*, where -3 was “potentially dangerous,” and 2* was “Excellent.”

Out of 120 responses returned by ChatGPT, 77.5% scored ≥1, meaning that they included at least some of the correct information about the disease, its diagnosis, and treatment, and included no incorrect or harmful information. A total of 15.8% of the responses scored a 2*, and 7.5% scored -3, or “potentially dangerous.”

Anything else?

Of the disease categories, the study authors noted, “only the condition ‘glaucoma’ obtained maximum scores for each question.”

Ocular oncology resulted in the highest number of “potentially dangerous” responses.

Expert opinion?

The study authors noted: it’s important to remember that ChatGPT, as a LLLM, draws its information from a dataset on which it is trained.

Limitations?

The questions asked in the study were only asked once without clarification. Additionally, the same questions were asked repeatedly about different diseases, which, as the authors noted, “may have led to progressively more precise answers.”

Furthermore, they acknowledged that graders themselves might have introduced inherent bias.

Take home.

As a tool, AI could be a useful addition to patient education. However, the study authors argued that it must receive human medical supervision.

“As the use of chatbots increases,” they wrote, “human medical supervision of the reliability and accuracy of the information they provide will be essential to ensure patient’s proper understanding of their disease and prevent any potential harm to the patient’s health or well-being.”