OpenAI Releases HealthBench Dataset to Test AI in Health Care

Medically reviewed by Carmen Pope, BPharm. Last updated on May 13, 2025.

By I. Edwards HealthDay Reporter

TUESDAY, May 13, 2025 — OpenAI has unveiled a large dataset to help test how well artificial intelligence (AI) models answer health care questions.

Experts call it a major step forward, but they also say more work is needed to ensure safety.

The dataset — called HealthBench — is OpenAI's first major independent health care project. It includes 5,000 “realistic health conversations,” each with detailed grading tools to evaluate AI responses, STAT News reported.

“Our mission as OpenAI is to ensure AGI is beneficial to humanity,” Karan Singhal, head of the San Francisco-based company's health AI team, said. AGI is shorthand for artificial general intelligence.

“One part of that is building and deploying technology," Singhal said. "Another part of it is ensuring that positive applications like health care have a place to flourish and that we do the right work to ensure that the models are safe and reliable in these settings.”

The dataset was created with help from 262 doctors who have worked in 60 countries. They provided more than 57,000 unique criteria to judge how well AI models answer health questions.

HealthBench aims to fix a common problem: Comparing different AI models fairly.

“What OpenAI has done is they have provided this in a scalable way from a really big, reputable brand that’s going to enable people to use this very easily,” Raj Ratwani, a health AI researcher at MedStar Health, said.

The 5,000 examples in HealthBench were made using synthesized conversations designed by physicians.

“We wanted to balance the benefits of being able to release the data with, of course, the privacy constraints of using realistic data," Singhal told STAT News.

The dataset also includes a special group of 1,000 hard examples where AI models struggled. OpenAI hopes this group “provides a worthy target for model improvements for months to come," STAT News reported.

OpenAI also tested its own models as well as models from Google, Meta, Anthropic and xAI. OpenAI’s o3 model scored the best, especially in communication quality, STAT News reported.

But models performed poorly in areas like context awareness and completeness, experts said.

Some warned about OpenAI grading its own models.

"In sensitive contexts like healthcare, where we are discussing life and death, that level of opacity is unacceptable," Hao explained.

Others noted that AI itself was used to grade some of the responses, which could result in errors being overlooked.

It “may hide errors shared by both model and grader,” Girish Nadkarni, head of artificial intelligence and human health at the Icahn School of Medicine at Mount Sinai in New York City, told STAT News.

He and others called for more reviews to ensure models work well in different countries and among different demographics.

“HealthBench improves LLM healthcare evaluation but still needs subgroup analysis and wider human review before it can support safety claims,” Nadkarni said.

Sources

STAT News, May 12, 2025

Disclaimer: Statistical data in medical articles provide general trends and do not pertain to individuals. Individual factors can vary greatly. Always seek personalized medical advice for individual healthcare decisions.

Source: HealthDay

Posted : 2025-05-14 06:00

Disclaimer

Every effort has been made to ensure that the information provided by Drugslib.com is accurate, up-to-date, and complete, but no guarantee is made to that effect. Drug information contained herein may be time sensitive. Drugslib.com information has been compiled for use by healthcare practitioners and consumers in the United States and therefore Drugslib.com does not warrant that uses outside of the United States are appropriate, unless specifically indicated otherwise. Drugslib.com's drug information does not endorse drugs, diagnose patients or recommend therapy. Drugslib.com's drug information is an informational resource designed to assist licensed healthcare practitioners in caring for their patients and/or to serve consumers viewing this service as a supplement to, and not a substitute for, the expertise, skill, knowledge and judgment of healthcare practitioners.

The absence of a warning for a given drug or drug combination in no way should be construed to indicate that the drug or drug combination is safe, effective or appropriate for any given patient. Drugslib.com does not assume any responsibility for any aspect of healthcare administered with the aid of information Drugslib.com provides. The information contained herein is not intended to cover all possible uses, directions, precautions, warnings, drug interactions, allergic reactions, or adverse effects. If you have questions about the drugs you are taking, check with your doctor, nurse or pharmacist.

OpenAI Releases HealthBench Dataset to Test AI in Health Care

Sources

Read more

Disclaimer

Popular Keywords