44
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Performance of Generative Pretrained Transformer on the National Medical Licensing Examination in Japan

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          The remarkable performance of ChatGPT, launched in November 2022, has significantly impacted the field of natural language processing, inspiring the application of large language models as supportive tools in clinical practice and research worldwide. Although GPT-3.5 recently scored high on the United States Medical Licensing Examination, its performance on medical licensing examinations of other nations, especially non-English speaking nations, has not been sufficiently evaluated. This study assessed GPT’s performance on the National Medical Licensing Examination (NMLE) in Japan and compared it with the actual minimal passing rate for this exam. In particular, the performances of both the GPT-3.5 and GPT-4 models were considered for the comparative analysis. We initially used the GPT models and several prompts for 290 questions without image data from the 116 th NMLE (held in February 2022 in Japan) to maximize the performance for delivering correct answers and explanations of the questions. Thereafter, we tested the performance of the best GPT model (GPT-4) with optimized prompts on a dataset of 262 questions without images from the latest 117 th NMLE (held in February 2023). The best model with the optimized prompts scored 82.7% for the essential questions and 77.2% for the basic and clinical questions, both of which sufficed the minimum passing scoring rates of 80.0% and 74.6%, respectively. After an exploratory analysis of 56 incorrect answers from the model, we identified the three major factors contributing to the generation of the incorrect answers—insufficient medical knowledge, information on Japan-specific medical system and guidelines, and mathematical errors. In conclusion, GPT-4 with our optimized prompts achieved a minimum passing scoring rate in the latest 117 th NMLE in Japan. Beyond its original design of answering examination questions for humans, these artificial intelligence (AI) models can serve as one of the best “sidekicks” for solving problems and addressing the unmet needs in the medical and healthcare fields.

          Author summary

          ChatGPT’s remarkable performance has inspired the use of large language models as supportive tools in clinical practice and research. Although it scored well in the US Medical Licensing Examination, its effectiveness in relevant examinations of non-English speaking countries remain unexplored. This study assessed the performance of GPT-3.5 and GPT-4 models in Japan’s National Medical Licensing Examination (NMLE). Initially, we used an optimization dataset of 290 questions from the 116th NMLE, and then the GPT-4 model with optimized prompts was tested on 262 questions from the 117th NMLE. The model scored 82.7% for essential and 77.2% for basic and clinical questions, surpassing the minimum passing scoring rates. Incorrect answers were attributed to insufficient medical knowledge, Japan-specific medical system information, and mathematical errors. In conclusion, GPT-4 achieved a minimum passing scoring rate and can be considered a valuable tool for fulfilling the needs of medical and healthcare fields.

          Related collections

          Most cited references15

          • Record: found
          • Abstract: found
          • Article: not found

          Attention Is All You Need

          The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data. 15 pages, 5 figures
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Language Models are Few-Shot Learners

            Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general. 40+32 pages
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models

              We evaluated the performance of a large language model called ChatGPT on the United States Medical Licensing Exam (USMLE), which consists of three exams: Step 1, Step 2CK, and Step 3. ChatGPT performed at or near the passing threshold for all three exams without any specialized training or reinforcement. Additionally, ChatGPT demonstrated a high level of concordance and insight in its explanations. These results suggest that large language models may have the potential to assist with medical education, and potentially, clinical decision-making.
                Bookmark

                Author and article information

                Contributors
                Role: ConceptualizationRole: Data curationRole: Formal analysisRole: MethodologyRole: VisualizationRole: Writing – original draftRole: Writing – review & editing
                Role: ConceptualizationRole: Data curationRole: MethodologyRole: VisualizationRole: Writing – original draftRole: Writing – review & editing
                Role: ConceptualizationRole: Data curationRole: VisualizationRole: Writing – original draftRole: Writing – review & editing
                Role: Data curationRole: Writing – review & editing
                Role: Data curationRole: Writing – review & editing
                Role: Data curationRole: Writing – review & editing
                Role: Data curationRole: Writing – review & editing
                Role: Data curationRole: Writing – review & editing
                Role: Data curationRole: Writing – review & editing
                Role: ValidationRole: Writing – review & editing
                Role: ValidationRole: Writing – review & editing
                Role: SupervisionRole: Writing – review & editing
                Role: SupervisionRole: Writing – review & editing
                Role: ConceptualizationRole: MethodologyRole: SupervisionRole: Writing – review & editing
                Role: ConceptualizationRole: MethodologyRole: Project administrationRole: ValidationRole: VisualizationRole: Writing – original draftRole: Writing – review & editing
                Role: Editor
                Journal
                PLOS Digit Health
                PLOS Digit Health
                plos
                PLOS Digital Health
                Public Library of Science (San Francisco, CA USA )
                2767-3170
                23 January 2024
                January 2024
                : 3
                : 1
                : e0000433
                Affiliations
                [1 ] School of Medicine, Kanazawa University, Kanazawa, Japan
                [2 ] Department of Health Promotion and Medicine of the Future, Kanazawa University Graduate School of Medicine, Kanazawa, Japan
                [3 ] Department of Molecular and Cellular Pathology, Graduate School of Medical Sciences, Kanazawa University, Kanazawa, Japan
                [4 ] Graduate School of Media and Governance, Keio University, Fujisawa, Japan
                [5 ] Advanced Research Center for Human Sciences, Waseda University, Saitama, Japan
                [6 ] Department of Otolaryngology and Head and Neck Surgery, Jichi Medical University, Tochigi, Japan
                [7 ] Department of Cardiovascular Medicine, Kanazawa University Graduate School of Medical Sciences, Kanazawa, Japan
                [8 ] College of Transdisciplinary Sciences for Innovation, Kanazawa University, Kanazawa Japan
                [9 ] MICIN, Inc., Tokyo, Co
                [10 ] Frontier Institute for Tourism Science, Kanazawa University, Kanazawa, Japan
                [11 ] Department of Biomedical Informatics, CureApp Institute, Karuizawa, Japan
                Mayo Clinic, Arizona, UNITED STATES
                Author notes

                The authors have declared that no competing interests exist.

                Author information
                https://orcid.org/0000-0003-4101-9886
                https://orcid.org/0000-0002-8743-9184
                https://orcid.org/0000-0003-0027-2551
                https://orcid.org/0009-0008-2445-1195
                https://orcid.org/0000-0001-6281-2782
                https://orcid.org/0000-0003-0461-3018
                https://orcid.org/0000-0002-1540-4417
                https://orcid.org/0000-0002-0921-158X
                https://orcid.org/0000-0001-6647-8240
                Article
                PDIG-D-23-00146
                10.1371/journal.pdig.0000433
                10805303
                38261580
                0e62a590-f375-48b5-b648-508dfac11686
                © 2024 Tanaka et al

                This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

                History
                : 18 April 2023
                : 19 December 2023
                Page count
                Figures: 4, Tables: 2, Pages: 16
                Funding
                The author(s) received no specific funding for this work.
                Categories
                Research Article
                People and Places
                Geographical Locations
                Asia
                Japan
                Biology and Life Sciences
                Neuroscience
                Cognitive Science
                Cognitive Psychology
                Language
                Biology and Life Sciences
                Psychology
                Cognitive Psychology
                Language
                Social Sciences
                Psychology
                Cognitive Psychology
                Language
                Medicine and Health Sciences
                Clinical Medicine
                Computer and Information Sciences
                Artificial Intelligence
                Medicine and Health Sciences
                Medicine and Health Sciences
                Critical Care and Emergency Medicine
                Medicine and Health Sciences
                Pediatrics
                Computer and Information Sciences
                Information Technology
                Natural Language Processing
                Custom metadata
                The whole input questions and answers from the model for the 117th NMLE in Japan are listed in the Supplemental Data. The codes used in this study are accessible via GitHub ( https://github.com/yudaitanaka1026/ChatGPT_NMLE_Japan).

                Comments

                Comment on this article

                scite_
                0
                0
                0
                0
                Smart Citations
                0
                0
                0
                0
                Citing PublicationsSupportingMentioningContrasting
                View Citations

                See how this article has been cited at scite.ai

                scite shows how a scientific paper has been cited by providing the context of the citation, a classification describing whether it supports, mentions, or contrasts the cited claim, and a label indicating in which section the citation was made.

                Similar content30

                Cited by14

                Most referenced authors469