AI Chatbots Raise Concerns After Study Finds Flawed Medical Responses

New research urges caution in health-related use of large language models.

A recently published peer-reviewed study has found that leading AI chatbots—including OpenAI’s GPT-4 and Google DeepMind’s Gemini—produced incorrect or potentially unsafe medical responses when prompted with subtly manipulated questions. The study, which tested multiple AI systems in a controlled environment, highlights growing concerns around the use of large language models (LLMs) in health-related contexts.

In contrast, Claude-3, developed by Anthropic, declined over 50% of these prompts—suggesting a more cautious design when faced with sensitive or ambiguous health queries.

While none of the models are formally approved for clinical use, the study’s authors argue that as these systems increasingly appear in consumer-facing health apps and platforms, stronger guardrails and oversight are needed.

When Polite Answers Become a Risk

The researchers used slightly altered phrasing in prompts—designed to mimic real-world user queries—to evaluate how AI systems respond. In several instances, models provided advice that conflicted with current medical guidelines or safety standards.

The issue, the researchers note, isn’t intentional misinformation, but rather AI models generating plausible-sounding text based on probabilistic predictions, not verified medical knowledge.

Why Regulation Is on the Table

The study calls for regulatory standards that require AI systems used in healthcare settings to reference verified medical knowledge bases, such as SNOMED CT or UMLS, to ensure factual accuracy.

“This research underscores the need for structure and accountability,” said one co-author of the study. “These models are powerful, but without verified sources, they can sound authoritative while being wrong.”

🔗 Anthropic’s Claude-3

Industry Caution and Next Steps

Companies developing AI models have already issued disclaimers that their systems are not substitutes for medical professionals. Still, as adoption increases in healthcare-related tools, the pressure is mounting to build models that recognize limitations and avoid high-risk outputs.

Key recommendations from the study include:

Using traceable, validated sources for medical responses.
Expanding “refusal training” to prevent speculative answers on health.
Developing AI standards specific to healthcare, including oversight mechanisms.

🔗 NIH on Knowledge Graphs in Healthcare

Bottom Line

AI chatbots are becoming more accessible, more conversational—and, sometimes, more wrong. When health is on the line, experts say it’s time to slow down and ensure that helpfulness doesn’t come at the cost of safety.

New research urges caution in health-related use of large language models.

When Polite Answers Become a Risk

Why Regulation Is on the Table

Industry Caution and Next Steps

Bottom Line

Leave a Comment Cancel Reply