2
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: not found

      Chinese RoBERTa Distillation For Emotion Classification

      ,
      The Computer Journal
      Oxford University Press (OUP)

      Read this article at

      ScienceOpenPublisher
      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Through knowledge distillation method, a student model can imitate the output of a teacher model to improve its generalization ability without changing the computational complexity. However, in existing knowledge distillation research, the efficiency of knowledge transfer is still not satisfactory, especially from pre-trained language models (PTMs) like Robustly optimized BERT approach (RoBERTa) to another structure student model. To address this issue, this paper proposes a prediction framework (RTLSTM) for Chinese emotion classification based on knowledge distillation. In RTLSTM, a new triple loss strategy is proposed for training a student ‘BiLSTM’, which combines supervised learning, distillation and word vector losses. This strategy enables the student to learn more fully from a teacher model RoBERTa and retains 99% of the teacher models’ language understanding capability. We carried out emotion classification experiments on five Chinese datasets to compare RTLSTM with baseline models. The experiment results show that RTLSTM outperforms the baseline models belonging to the RNN group in terms of prediction performance under similar numbers of parameters. Moreover, RTLSTM is superior to the PTMs group baseline models through 92% fewer parameters and 83% less prediction time under comparable prediction performance.

          Related collections

          Most cited references30

          • Record: found
          • Abstract: found
          • Article: not found

          Long Short-Term Memory

          Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient-based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O(1). Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.
            Bookmark
            • Record: found
            • Abstract: not found
            • Conference Proceedings: not found

            Convolutional Neural Networks for Sentence Classification

            Yoon Kim (2014)
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              RoBERTa: A Robustly Optimized BERT Pretraining Approach

              Language model pretraining has led to significant performance gains but careful comparison between different approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes, and, as we will show, hyperparameter choices have significant impact on the final results. We present a replication study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many key hyperparameters and training data size. We find that BERT was significantly undertrained, and can match or exceed the performance of every model published after it. Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD. These results highlight the importance of previously overlooked design choices, and raise questions about the source of recently reported improvements. We release our models and code.
                Bookmark

                Author and article information

                Journal
                The Computer Journal
                Oxford University Press (OUP)
                0010-4620
                1460-2067
                December 2023
                December 14 2023
                November 10 2022
                December 2023
                December 14 2023
                November 10 2022
                : 66
                : 12
                : 3107-3118
                Article
                10.1093/comjnl/bxac153
                fbc68951-fc00-49c5-a859-d453b173083a
                © 2022

                https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model

                History

                Comments

                Comment on this article

                scite_
                0
                0
                0
                0
                Smart Citations
                0
                0
                0
                0
                Citing PublicationsSupportingMentioningContrasting
                View Citations

                See how this article has been cited at scite.ai

                scite shows how a scientific paper has been cited by providing the context of the citation, a classification describing whether it supports, mentions, or contrasts the cited claim, and a label indicating in which section the citation was made.

                Similar content246

                Cited by1

                Most referenced authors263