Findings from a study recently published in The Lancet Digital Health demonstrated the capability of a new platform for comparing the accuracy of multiple automated retinal image analysis systems (ARIAS) in a real-life diabetic retinopathy (DR) screening program.
Give me some background.
The global prevalence of diabetes is rising, alongside costs and workload associated with screening for diabetic eye disease.
- Moreover: The English NHS Diabetic Eye Screening Programme (DESP) produces over 12 million retinal images each year that require human grading for DR.
The study authors explained that several ARIAS—including artificial intelligence (AI) algorithms, can safely and effectively triage patients into low-, medium-, or high-risk DR—which could considerably reduce the number of encounters requiring human grading.
Is this the first study to compare multiple AI screening platforms?
To date, most evaluations of ARIAS have largely relied on a single system applied to a specific dataset, with utilization of ARIAS in large-scale screening being restricted globally.
- However: ARIAS that are trained and tested using retinal images from restricted demographic and ethnic groups could amplify biases in marginalized patient populations.
Meaning: To prevent mistrust and financial disinvestment in AI technologies, algorithmic fairness must be assessed across large, real-life, diverse population data to ensure that systems meet predefined standards before they are deployed in healthcare settings.
Now to this study.
In this large-scale, multi-vendor comparison of licensed ARIAS, investigators utilized 202,886 screening encounters at the North East London DESP that occurred between Jan. 1, 2021 and Dec. 31, 2022 to curate a database of 1.2 million images and sociodemographic and grading data.
Note: Images were manually graded by up to three graders according to a standard national protocol.
And which systems were included in the evaluation?
Eight of 25 invited and potentially eligible CE-marked systems for DR detection from retinal images agreed to participate:
- DRISTi 2.0 (Artelus Ltd.)
- EyeArt v3.0.0 (EyeNuk Inc.)
- Eyecheckup AI (EyeCheckup)
- MONA (MONA.health)
- NEC (NEC Software Solutions)
- OphtAI 2.3 (Evolucare Technologies SAS)
- Remidio (Remido Innovative Solutions Pvt. Ltd.)
- Retmarker (Retmarker SA)
What about subgroup analyses?
ARIAS performance was evaluated overall and by the following subgroups:
- Age (mean age: 60.5 years)
- Sex (47% female)
- Ethnicity (9% South Asian, 32% white, and 17% Black)
- Index of multiple deprivation (IMD)
Findings from the subgroup analyses were assessed against the reference standard, which was defined as the final human grade in the worst eye for referable DR.
Findings?
Sensitivity across vendors ranged across DR severity as follows:
- Referable DR: 83.7–98.7%
- Moderate-to-severe non-proliferative (NPDR) and proliferative DR (PDR): 96.7–99.8%
- PDR: 95.8–99.5%
Plus: Sensitivity was largely consistent for moderate-to-severe NPDR and PDR across all subgroups (listed above) for each ARIAS.
Additionally: ARIAS had lower performance (compared to human graders) for the oldest age group, likely due to poor image quality in older people.
What about performance between DR screening systems?
The two highest performing AI algorithms were the EyeArt v3.0.0 and NEC, with sensitivity for referable DR of 98.2% and 98.7%, respectively.
Conversely: The vendors with the lowest sensitivity for referable DR were the Retmarker (83.7%) and the MONA (84.2%).
To see a table comparing the eight AI systems, click here!
Anything else to note?
The AI systems took 240 milliseconds to 45 seconds to analyze all images per patient, compared with up to 20 minutes for a trained human.
For mild-to-moderate NPDR with referable maculopathy, sensitivity across vendors ranged from 79.5–98.3%, with greater variability across population subgroups.
False positive rates for no observable DR ranged from 4.3–61.4% and within vendors varied by 0.5–44 percentage points across population subgroups.
Expert opinion?
The study authors noted that this new platform represents “a transferable framework for the evaluation of clinical AI, ensuring algorithms meet predefined standards for fairness and trustworthiness before being commissioned.”
- “By focusing on algorithmic fairness, we aim to promote equal opportunities for ARIAS in healthcare services, preventing monopolies and encouraging investment,” they added.
- The goal: Building trust, innovation, and cost-effective advancements.
Take home.
These findings suggest that ARIAS showed high sensitivity for medium- and high-risk DR in a real-world screening service, with consistent performance across population subgroups.
Meaning: ARIAS could provide a cost-effective solution to deal with the rising burden of DR screening by safely triaging for human grading, potentially increasing capacity and rapid DR detection.