13
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback

      Preprint

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          To solve complex tasks, large language models (LLMs) often require multiple rounds of interactions with the user, sometimes assisted by external tools. However, current evaluation paradigms often focus solely on benchmark performance with single-turn exchanges, neglecting the intricate interactions among the user, LLMs, and external tools, creating a discrepancy between benchmark evaluation and real-world use cases. We introduce MINT benchmark to evaluate LLMs' ability to solve tasks with multi-turn interactions by (1) using tools and (2) leveraging natural language feedback. To ensure reproducibility, we provide an evaluation framework where LLMs can access tools by executing Python code and receive natural language feedback from the user simulated with GPT-4. We repurpose a diverse set of established datasets and tasks focusing on reasoning, coding, and decision-making and carefully curate them into a compact subset of instances for efficient evaluation. Our analysis of 20 open- and closed-source LLMs offers intriguing findings. (1) LLMs generally benefit from tool interactions and language feedback, with performance gains (absolute, same below) of 1--8% per additional turn with tool use and 2--17% with natural language feedback. (2) Better single-turn performance does not guarantee better multi-turn performance. (3) Surprisingly, on LLMs we evaluated, we found supervised instruction-finetuning (SIFT) and reinforcement learning from human feedback (RLHF) generally hurt multi-turn capabilities. We hope MINT can help measure progress and incentivize research in improving LLMs' capabilities in multi-turn interactions, especially for open-source communities where multi-turn human evaluation has been less accessible compared to commercial LLMs with a larger user base.

          Related collections

          Author and article information

          Journal
          19 September 2023
          Article
          2309.10691
          1a3cd15b-b873-4e85-8d81-e2385ffafdcc

          http://creativecommons.org/licenses/by/4.0/

          History
          Custom metadata
          Code will be available at https://xingyaoww.github.io/mint-bench
          cs.CL cs.AI cs.LG

          Theoretical computer science,Artificial intelligence
          Theoretical computer science, Artificial intelligence

          Comments

          Comment on this article