13
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Systematic testing of three Language Models reveals low language accuracy, absence of response stability, and a yes-response bias

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Significance

          The synthetic language generated by recent Large Language Models (LMs) strongly resembles the natural languages of humans. This resemblance has given rise to claims that LMs can serve as the basis of a theory of human language. Given the absence of transparency as to what drives the performance of LMs, the characteristics of their language competence remain vague. Through systematic testing, we demonstrate that LMs perform nearly at chance in some language judgment tasks, while revealing a stark absence of response stability and a bias toward yes-responses. Our results raise the question of how knowledge of language in LMs is engineered to have specific characteristics that are absent from human performance.

          Abstract

          Humans are universally good in providing stable and accurate judgments about what forms part of their language and what not. Large Language Models (LMs) are claimed to possess human-like language abilities; hence, they are expected to emulate this behavior by providing both stable and accurate answers, when asked whether a string of words complies with or deviates from their next-word predictions. This work tests whether stability and accuracy are showcased by GPT-3/text-davinci-002, GPT-3/text-davinci-003, and ChatGPT, using a series of judgment tasks that tap on 8 linguistic phenomena: plural attraction, anaphora, center embedding, comparatives, intrusive resumption, negative polarity items, order of adjectives, and order of adverbs. For every phenomenon, 10 sentences (5 grammatical and 5 ungrammatical) are tested, each randomly repeated 10 times, totaling 800 elicited judgments per LM (total n = 2,400). Our results reveal variable above-chance accuracy in the grammatical condition, below-chance accuracy in the ungrammatical condition, a significant instability of answers across phenomena, and a yes-response bias for all the tested LMs. Furthermore, we found no evidence that repetition aids the Models to converge on a processing strategy that culminates in stable answers, either accurate or inaccurate. We demonstrate that the LMs’ performance in identifying (un)grammatical word patterns is in stark contrast to what is observed in humans (n = 80, tested on the same tasks) and argue that adopting LMs as theories of human language is not motivated at their current stage of development.

          Related collections

          Most cited references44

          • Record: found
          • Abstract: found
          • Article: not found

          Language Models are Few-Shot Learners

          Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general. 40+32 pages
            Bookmark
            • Record: found
            • Abstract: not found
            • Conference Proceedings: not found

            On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?

              Bookmark
              • Record: found
              • Abstract: not found
              • Book: not found

              Syntactic Structures

                Bookmark

                Author and article information

                Contributors
                Journal
                Proc Natl Acad Sci U S A
                Proc Natl Acad Sci U S A
                PNAS
                Proceedings of the National Academy of Sciences of the United States of America
                National Academy of Sciences
                0027-8424
                1091-6490
                13 December 2023
                19 December 2023
                13 December 2023
                : 120
                : 51
                : e2309583120
                Affiliations
                [1] aDepartament d'Estudis Anglesos i Alemanys, Universitat Rovira i Virgili , Tarragona 43002, Spain
                [2] bInstitut für Psychologie, Humboldt-Universitat zu Berlin , Berlin 10099, Germany
                [3] cDepartament de Filologia Catalana, Universitat Autònoma de Barcelona , Barcelona 08193, Spain
                [4] dInstitució Catalana de Recerca i Estudis Avançats (ICREA) , Barcelona 08010, Spain
                Author notes
                1To whom correspondence may be addressed. Email: vittoria.dentella@ 123456urv.cat .

                Edited by Susan Goldin-Meadow, University of Chicago, Chicago, IL; received June 7, 2023; accepted October 28, 2023

                Author information
                https://orcid.org/0000-0001-6697-9184
                https://orcid.org/0000-0002-9205-6786
                https://orcid.org/0000-0003-3181-1917
                Article
                202309583
                10.1073/pnas.2309583120
                10743380
                38091290
                49e4cba0-1f72-4e1e-ae29-474a053c9bf0
                Copyright © 2023 the Author(s). Published by PNAS.

                This open access article is distributed under Creative Commons Attribution License 4.0 (CC BY).

                History
                : 07 June 2023
                : 28 October 2023
                Page count
                Pages: 10, Words: 7461
                Funding
                Funded by: Horizon Europe 2020 | Marie Sklodowska-Curie Actions;
                Award ID: 945413
                Award Recipient : Vittoria Dentella Award Recipient : Evelina Leivada
                Funded by: Universitat Rovira i Virgili (Rovira i Virgili University), FundRef 501100007512;
                Award ID: 945413
                Award Recipient : Vittoria Dentella
                Funded by: Deutsche Forschungsgemeinschaft (DFG), FundRef 501100001659;
                Award ID: 459717703
                Award Recipient : Fritz Günther
                Funded by: Spanish Ministry of Science and Innovation;
                Award ID: PID2021-124399NA-I00
                Award Recipient : Vittoria Dentella Award Recipient : Evelina Leivada
                Categories
                research-article, Research Article
                psych-soc, Psychological and Cognitive Sciences
                431
                Social Sciences
                Psychological and Cognitive Sciences

                language models,cognitive models,bias,language
                language models, cognitive models, bias, language

                Comments

                Comment on this article