Abstract Details

Title Large Language Model Performance in Neurology Board Questions

Topic Education, Research, and Methodology

Presentation(s) S33 - Innovations in Medical Education (1:00 PM-1:12 PM)

Poster/Presentation
Number 001

Background

Artificial intelligence (AI) has revolutionized various industries, with notable impacts in areas like autonomous vehicles, and image recognition. Despite AI's prevalence, its capability in medical clinical decision-making remains understudied. ChatGPT, a LLM developed by OpenAI, has shown promise in the medical field, even passing exams like the USMLE. However, its capacity in comparison to human test-takers, especially for the neurology board exam, remains an area of exploration.

Objective Evaluate and compare the performance of Large Language Model (LLM) on Neurology board exam, with human test-takers.

Design/Methods This study evaluated the performance of Generative Pre-trained Transformer version 4 (GPT-4) language model in NeuroReady®: Board Prep question bank, which is considered representative of the American Board of Psychiatry and Neurology board exam. Four hundred questions were entered into individual GPT-4 chat sessions to avoid memory retention, and the model's accuracy was assessed across various factors using appropriate statistical tests with statistical significance set at a P-value below 0.05.

Results

With an accuracy rate of 75.0% (N=400, 95% Confidence Interval (CI): 70.5-79.2%), GPT-4 outperformed the average test taker score of 69% and the passing score of 70%. The model's accuracy was not associated with question length (Odds Ratio (OR) = 0.999 per one word increase, 95% CI: 0.993-1.005, P=0.693) but was lower for questions involving images (61.1% versus 78.0%, P=0.003) and those requiring higher levels of thinking (71.7% versus 81.0%, P=0.040). The model's accuracy showed a positive correlation with test taker performance for each question (OR 1.56 for 10% increase in test taker’s accuracy, 95% CI: 1.37-1.78, P<0.001). GPT-4 excelled in specific neurology subsections, such as neuromuscular disorders, pharmacology, and cognitive and behavior disorders.

Conclusions

While AI has immense potential to assist medical education and clinical decision-making, rigorous verification, validation, and physician supervision are necessary to ensure its accuracy and reliability in the complex field of neurology.

Authors/Disclosures
Liqi Shu, MD (Brown Neurology) PRESENTER	Dr. Shu has nothing to disclose.
Daniel Mandel, MD	Dr. Mandel has nothing to disclose.
Oliver Tang	No disclosure on file
Zhaowei Jiang (Brown University)	No disclosure on file
Eric Goldstein, MD (Brown University Warren Alpert School of Medicine)	Dr. Goldstein has nothing to disclose.
Ali Mahta, MD (Brown University)	Dr. Mahta has nothing to disclose.