Can you trust AI for stroke care? Not yet, say scientists

A+ A- go back

Focus Image

Bar chart showing mean scores (with error bars) for ChatGPT-4o, Claude 3 Sonnet, and Gemini Ultra 1.0 in the recovery stage of stroke across five domains: accuracy (ChatGPT 4o 66.0; Gemini Ultra 1.0 62.7; Claude 3 Sonnet 61.5), hallucination (34–36), specificity & relevance (54-60), empathy & understanding (~61), and actionability (56–60). A score of 60 is the clinical competency threshold. Credit: npj Digital Medicine (2025). DOI: 10.1038/s41746-025-01830-9

Scientists have found that three language-model chatbots—even with advanced prompt-engineering tricks—often give suboptimal guidance across stroke prevention, diagnosis, treatment and recovery, highlighting the need for human oversight to ensure appropriateness and safety. Stroke remains a leading cause of death and disability worldwide, underscoring the urgency for accurate and actionable patient guidance.

In an international study conducted at National Taiwan University and Harvard T.H. Chan School of Public Health, the research team evaluated whether generative AI chatbots—ChatGPT-4o, Claude 3 Sonnet, and Gemini Ultra 1.0—are suitable for providing clinically reliable advice in stroke care. The results are published in the journal npj Digital Medicine.

To ensure clinical relevance, the research team first set up a typical clinical presentation of a stroke patient across the care continuum. The stroke-related inquiries posed to the AI models were based on the most common patient questions encountered in , spanning four stages of : prevention, early symptom recognition, acute treatment, and rehabilitation. These inquiries were crafted in consultation with clinical experts, reflecting realistic, patient-oriented scenarios.

Each model was tested under three prompting strategies—Zero-Shot Learning (ZSL), Chain-of-Thought (COT), and Talking Out Your Thoughts (TOT)—and four senior stroke specialists, blinded to model and prompt type, were asked to score outputs on accuracy, hallucinations (fewer hallucinations = higher score), specificity, empathy, and actionability. Success was aligned with the 60/100 cutoff of Taiwan's medical-doctor qualification exam, treating any score below this mark as potentially unsafe for independent patient use.

Scores averaged between 48 and 56 across all stages—an improvement over earlier reports, but still below the clinical competency threshold. In prevention and rehabilitation scenarios, models occasionally reached or slightly exceeded 60 when paired with TOT prompts, reflecting gains in empathy and clear guidance. ZSL prompts tended to reduce hallucinations more effectively. However, no model-prompt combination passed consistently, and all struggled most with acute treatment questions.

"Existing evidence suggests generative AI has real potential to help close health gaps and ease the shortage of health care workers in underserved and rural areas, especially when specialist access is limited. Our results show that while generative AI is impressive for general health information, it remains unreliable when patients face high-risk medical situations like stroke," says John Tayu Lee, Associate Professor at National Taiwan University and Senior Researcher at the Health Systems Innovation Lab at Harvard T.H. Chan School of Public Health.

"While thoughtful prompts may sharpen chatbot answers, they won't make a general-purpose model doctor-smart overnight. Like mirrors, clear questions yield clear replies," said Vincent Cheng-Sheng Li, second author, National Taiwan University. "However, turning those reflections into safe bedside guidance demands AI–clinician teamwork."

Prof. Rifat Atun, senior author, and Professor and Director of the Health Systems Innovation Lab at Harvard University remarked, "Generative AI holds huge potential for enhancing global health equity, as the GenAI solutions can be disseminated readily for wide application at low cost. But these solutions must be deployed responsibly, with robust governance, rigorous clinical validation, and human oversight to ensure appropriateness and safety."

"Artificial intelligence is transforming health care worldwide. By combining advanced computer science with medical expertise, patient-centered language models can bridge cutting-edge technology with real clinical needs," said Dr. Wei Jou Duh, CEO of NTU AI Research Center. "As AI advances rapidly, newer models may perform differently—but the benchmarks and methods offer a rigorous foundation for evaluating their impact."

More information: John Tayu Lee et al, Evaluation of performance of generative large language models for stroke care, npj Digital Medicine (2025). DOI: 10.1038/s41746-025-01830-9

Journal information: npj Digital Medicine 

Go Back