Performance of Generative Pretrained Transformer on the National Medical Licensing Examination in Japan

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

The remarkable performance of ChatGPT, launched in November 2022, has significantly impacted the field of natural language processing, inspiring the application of large language models as supportive tools in clinical practice and research worldwide. Although GPT-3.5 recently scored high on the United States Medical Licensing Examination, its performance on medical licensing examinations of other nations, especially non-English speaking nations, has not been sufficiently evaluated. This study assessed GPT’s performance on the National Medical Licensing Examination (NMLE) in Japan and compared it with the actual minimal passing rate for this exam. In particular, the performances of both the GPT-3.5 and GPT-4 models were considered for the comparative analysis. We initially used the GPT models and several prompts for 290 questions without image data from the 116 ^th NMLE (held in February 2022 in Japan) to maximize the performance for delivering correct answers and explanations of the questions. Thereafter, we tested the performance of the best GPT model (GPT-4) with optimized prompts on a dataset of 262 questions without images from the latest 117 ^th NMLE (held in February 2023). The best model with the optimized prompts scored 82.7% for the essential questions and 77.2% for the basic and clinical questions, both of which sufficed the minimum passing scoring rates of 80.0% and 74.6%, respectively. After an exploratory analysis of 56 incorrect answers from the model, we identified the three major factors contributing to the generation of the incorrect answers—insufficient medical knowledge, information on Japan-specific medical system and guidelines, and mathematical errors. In conclusion, GPT-4 with our optimized prompts achieved a minimum passing scoring rate in the latest 117 ^th NMLE in Japan. Beyond its original design of answering examination questions for humans, these artificial intelligence (AI) models can serve as one of the best “sidekicks” for solving problems and addressing the unmet needs in the medical and healthcare fields.

Author summary

ChatGPT’s remarkable performance has inspired the use of large language models as supportive tools in clinical practice and research. Although it scored well in the US Medical Licensing Examination, its effectiveness in relevant examinations of non-English speaking countries remain unexplored. This study assessed the performance of GPT-3.5 and GPT-4 models in Japan’s National Medical Licensing Examination (NMLE). Initially, we used an optimization dataset of 290 questions from the 116th NMLE, and then the GPT-4 model with optimized prompts was tested on 262 questions from the 117th NMLE. The model scored 82.7% for essential and 77.2% for basic and clinical questions, surpassing the minimum passing scoring rates. Incorrect answers were attributed to insufficient medical knowledge, Japan-specific medical system information, and mathematical errors. In conclusion, GPT-4 achieved a minimum passing scoring rate and can be considered a valuable tool for fulfilling the needs of medical and healthcare fields.

Related collections

Most cited references 15

Record: found
Abstract: found
Article: not found

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar … (2017)

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data. 15 pages, 5 figures

0 comments Cited 3171 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Language Models are Few-Shot Learners

Tom B. Brown, Benjamin Mann, Nick Ryder … (2020)

Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general. 40+32 pages

0 comments Cited 957 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models

Tiffany H. Kung, Morgan Cheatham, Arielle Medenilla … (2023)

We evaluated the performance of a large language model called ChatGPT on the United States Medical Licensing Exam (USMLE), which consists of three exams: Step 1, Step 2CK, and Step 3. ChatGPT performed at or near the passing threshold for all three exams without any specialized training or reinforcement. Additionally, ChatGPT demonstrated a high level of concordance and insight in its explanations. These results suggest that large language models may have the potential to assist with medical education, and potentially, clinical decision-making.

0 comments Cited 900 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Yudai Tanaka:

ORCID: https://orcid.org/0000-0003-4101-9886

Role: ConceptualizationRole: Data curationRole: Formal analysisRole: MethodologyRole: VisualizationRole: Writing – original draftRole: Writing – review & editing

Takuto Nakata:

ORCID: https://orcid.org/0000-0002-8743-9184

Role: ConceptualizationRole: Data curationRole: MethodologyRole: VisualizationRole: Writing – original draftRole: Writing – review & editing

Ko Aiga: Role: ConceptualizationRole: Data curationRole: VisualizationRole: Writing – original draftRole: Writing – review & editing

Takahide Etani:

ORCID: https://orcid.org/0000-0003-0027-2551

Role: Data curationRole: Writing – review & editing

Ryota Muramatsu:

ORCID: https://orcid.org/0009-0008-2445-1195

Role: Data curationRole: Writing – review & editing

Shun Katagiri: Role: Data curationRole: Writing – review & editing

Hiroyuki Kawai: Role: Data curationRole: Writing – review & editing

Fumiya Higashino: Role: Data curationRole: Writing – review & editing

Masahiro Enomoto: Role: Data curationRole: Writing – review & editing

Masao Noda:

ORCID: https://orcid.org/0000-0001-6281-2782

Role: ValidationRole: Writing – review & editing

Mitsuhiro Kometani:

ORCID: https://orcid.org/0000-0003-0461-3018

Role: ValidationRole: Writing – review & editing

Masayuki Takamura:

ORCID: https://orcid.org/0000-0002-1540-4417

Role: SupervisionRole: Writing – review & editing

Takashi Yoneda:

ORCID: https://orcid.org/0000-0002-0921-158X

Role: SupervisionRole: Writing – review & editing

Hiroaki Kakizaki: Role: ConceptualizationRole: MethodologyRole: SupervisionRole: Writing – review & editing

Akihiro Nomura:

ORCID: https://orcid.org/0000-0001-6647-8240

Role: ConceptualizationRole: MethodologyRole: Project administrationRole: ValidationRole: VisualizationRole: Writing – original draftRole: Writing – review & editing

Imon Banerjee: Role: Editor

Journal

Journal ID (nlm-ta): PLOS Digit Health

Journal ID (iso-abbrev): PLOS Digit Health

Journal ID (publisher-id): plos

Title: PLOS Digital Health

Publisher: Public Library of Science (San Francisco, CA USA )

ISSN (Electronic): 2767-3170

Publication date (Electronic): 23 January 2024

Publication date Collection: January 2024

Volume: 3

Issue: 1

Electronic Location Identifier: e0000433

Affiliations

[1 ] School of Medicine, Kanazawa University, Kanazawa, Japan

[2 ] Department of Health Promotion and Medicine of the Future, Kanazawa University Graduate School of Medicine, Kanazawa, Japan

[3 ] Department of Molecular and Cellular Pathology, Graduate School of Medical Sciences, Kanazawa University, Kanazawa, Japan

[4 ] Graduate School of Media and Governance, Keio University, Fujisawa, Japan

[5 ] Advanced Research Center for Human Sciences, Waseda University, Saitama, Japan

[6 ] Department of Otolaryngology and Head and Neck Surgery, Jichi Medical University, Tochigi, Japan

[7 ] Department of Cardiovascular Medicine, Kanazawa University Graduate School of Medical Sciences, Kanazawa, Japan

[8 ] College of Transdisciplinary Sciences for Innovation, Kanazawa University, Kanazawa Japan

[9 ] MICIN, Inc., Tokyo, Co

[10 ] Frontier Institute for Tourism Science, Kanazawa University, Kanazawa, Japan

[11 ] Department of Biomedical Informatics, CureApp Institute, Karuizawa, Japan

Mayo Clinic, Arizona, UNITED STATES

Author notes

The authors have declared that no competing interests exist.

* E-mail: anomura@ 123456med.kanazawa-u.ac.jp

Author information

Yudai Tanaka https://orcid.org/0000-0003-4101-9886

Takuto Nakata https://orcid.org/0000-0002-8743-9184

Takahide Etani https://orcid.org/0000-0003-0027-2551

Ryota Muramatsu https://orcid.org/0009-0008-2445-1195

Masao Noda https://orcid.org/0000-0001-6281-2782

Mitsuhiro Kometani https://orcid.org/0000-0003-0461-3018

Masayuki Takamura https://orcid.org/0000-0002-1540-4417

Takashi Yoneda https://orcid.org/0000-0002-0921-158X

Akihiro Nomura https://orcid.org/0000-0001-6647-8240

Article

Publisher ID: PDIG-D-23-00146

DOI: 10.1371/journal.pdig.0000433

PMC ID: 10805303

PubMed ID: 38261580

SO-VID: 0e62a590-f375-48b5-b648-508dfac11686

License:

This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

History

Date received : 18 April 2023

Date accepted : 19 December 2023

Page count

Figures: 4, Tables: 2, Pages: 16

Funding

The author(s) received no specific funding for this work.

Custom metadata

Data Availability The whole input questions and answers from the model for the 117th NMLE in Japan are listed in the Supplemental Data. The codes used in this study are accessible via GitHub ( https://github.com/yudaitanaka1026/ChatGPT_NMLE_Japan).

Data availability:

Comments

Comment on this article

scite_

Smart Citations

Citing PublicationsSupportingMentioningContrasting

View Citations

See how this article has been cited at scite.ai

scite shows how a scientific paper has been cited by providing the context of the citation, a classification describing whether it supports, mentions, or contrasts the cited claim, and a label indicating in which section the citation was made.

Cited by 14

See all cited by

Most referenced authors 469

See all reference authors

Performance of Generative Pretrained Transformer on the National Medical Licensing Examination in Japan

Read this article at

Abstract

Author summary

Related collections

Medical Physics Publishing

Most cited references 15

Attention Is All You Need

Language Models are Few-Shot Learners

Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models

Author and article information

Contributors

Journal

Affiliations

Author notes

Author information

Article

History

Page count

Funding

Categories

Custom metadata

Comments

Comment on this article

Similar content 30

Cited by 14

Most referenced authors 469