Introduction
Large language models already perform successfully in supportive roles in Economics research, for instance when tasked with proposing titles and writing abstracts for academic papers (Ash and Hansen, 2023). As they continue to improve, we must face the intriguing possibility of casting them in leading roles. In this paper, OpenAI’s GPT-4 model 1 auditions for such a role, as it is asked to enhance standard instructional texts in economics through judicious use of humor.
This paper investigates whether entirely autonomously generated computational humor can impact learning experience. (Very) large language models have undergone enormous improvements recently and are able to perform high-level actions with such accuracy that GPT-3 has earned an IQ of 150 (Ray, 2023). Humor, which is infinitely versatile by nature, seems an adequate test, heightened by the challenges of an educational context.
We generated the pipeline that prompts the model to produce adequate instructional humor to enhance three instructional texts in economics. The instructional texts summarize the contributions of three Nobel prize-winning theorists: Oliver Hart, Bengt Holmström and Paul Milgrom. The reason for this choice is that all three had accepted to participate in a comedian-presented (and therefore more light-hearted than usual) panel discussion which the authors of the study helped organize at the bicentennial meeting of the European Finance Association in August 2023 in Amsterdam. 2 By focusing on these particular laureates, we test the effect of seeing short clips of the laureates in a humorous context on the understanding of their work. 3
We tested the pedagogical efficacy of the GPT-4 generated model in a sample of 52 undergraduate students and found that simply exposing students to the version of the instructional texts enhanced by computational humor is not sufficient to induce a significant difference in performance. What matters is that the attempt at humor be successful: only the students who find the ‘humorous’ text actually ‘funny’ have significantly better results. 4 However, these encouraging findings are driven by the sense of humor and performance of a subset of only seven out of the 26 respondents who were exposed to the humorous version of the instructional texts. Thus, we acknowledge the small sample limitations of our empirical exercise, and welcome further scrutiny of this new and potentially very rewarding field of research, which enlists the high-level capabilities of large language models in the service of economics education.
Theoretical considerations
Humor in education
Numerous studies have attempted to determine whether humor enhances learning, with mixed results. On the one hand, instructional humor processing theory (IHPT) proposes that instructor humor increases recall and learning, provided that (a) the humor is relevant to the instructional contents; (b) students actually perceive it as funny; and (c) students are motivated and able to process the instructional message (Wanzer et al., 2010). On the other hand, several studies have exposed students to either standard or humorous examples and found that the students in the latter group perform worse. For example, Bolkan et al. (2018) conclude that, when humor is integrated in the instructional lessons, it competes for student learning with the concepts being taught. In their view, contiguous humor (not linked directly to the content) may provide experiential and motivational benefits with less risk for learning outcomes. However, Bolkan et al. (2018) do not test whether students who perceived the humorous examples as funny performed worse or better than those who did not or than those in the control group.
Given the importance of the content and quality of the humor involved, it is surprising how little of the research on instructional humor is based on humor generated by the authors (rather than on polling students about their experience in the classroom). Of the handful of studies that do generate their own humorous material, most are vague on how this was done (e.g., Celik and Gungdogdu, 2016; Buttussi and Chittaro, 2020; Erdogdu and Cakiroglu, 2021), which not only impedes replicability, but offers little practical help to instructors.
Hypotheses
Given the challenges of generating adequate and effective humor for the classroom and the replicability issues already mentioned, this paper opens the way towards investigating whether and how large language models can help. While artificial intelligence is already an established feature in higher education for such tasks as automatic question generation and grading, as well as intelligent tutoring systems designed to provide individualized feedback to students (see Crompton and Burke, 2023, for a review), to our knowledge, its ability to produce effective instructional humor has not yet been put to the test. We hypothesize that:
large language models produce adequate instructional humor, and that
instructional humor enhances learning.
We test these hypotheses in a pilot experiment.
Empirical analysis
Computational humor
Large language models can be leveraged to act as agents following predefined objectives. In this study, we use OpenAI’s GPT-4 model to create an entirely autonomous method for generating instructional humor. The model has been tasked to integrate humor into the content through a metaphor, anecdote or quip, while at the same time avoiding humor-induced ambiguity. To mitigate potential errors while preserving the autonomy of the process, we draw on the methodological contributions of Shinn et al. (2023) and Nair et al. (2023) and add an iterative ‘self-reflection’ feature to the pipeline: the model will self-evaluate and improve through several iterations. The original (input) text passes through three transformative steps before the final output is produced:
generator – enhances the text with instructional humor
evaluator – lists the pros and cons of the generated text, and
decider – selects the best option.
The generator and evaluator functions have been assigned different roles: university professor in economics with experience and achievements in humorous teaching, researcher in educational humor and comedian with a background in economics. The decider function plays the role of a professor experienced in the evaluation of educational humor. To illustrate the validity of the process, we provide the following excerpt from the input text:
In his analysis on how a CEO’s contract should be formulated, Holmström proposed the ‘multi-tasking model’, which acknowledges the complexity of a CEO’s role and the various tasks they need to perform.
This excerpt, enhanced by computational humor, becomes:
When it comes to our multitasking maestros, the CEOs, Holmström came up with the aptly named ‘multi-tasking model’. This model acknowledges that a CEO’s role is as complex as a Rubik’s cube, with various tasks that need to be tackled effectively.
We found that GPT-4 can be capable of producing adequate instructional humor, in line with our first hypothesis. Although only seven out of our 26 respondents who received the humorous versions of the instructional texts rated the GPT-4 humor as actually funny, they are also the ones whose performance was better. Given the vast differences in humor appreciation across individuals (see Warren et al., 2020, for a comprehensive review), we make all the instructional texts available 5 so that readers can judge GPT-4’s comic aptitude for themselves.
Experimental design
Some 52 respondents aged 18 to 35 and currently enrolled in undergraduate programs were recruited through the Prolific survey platform. 6 The participants were screened in order to achieve equal gender distribution and then randomly assigned to one of two groups: the control group, which received three original instructional texts (input to the GPT-4 pipeline) and the test group, which received the GPT-4 output texts, enhanced by computational humor. The texts (approximately 800 words long) were followed by a quiz consisting of 17 multiple choice questions and a short survey designed to elicit the respondents’ perceptions (on a five-point scale), along several dimensions:
familiarity with economics – (1) ‘not at all familiar’ to (5) ‘extremely familiar’
attention – (1) ‘did not capture my attention’ to (5) ‘did capture my attention’
excitement – (1) ‘boring’ to (5) ‘exciting’
interest – (1) ‘not at all interesting’ to (5) ‘very interesting’
humor – (1) ‘serious’ to (5) ‘humorous’
fun: (1) ‘not funny’ to (5) ‘funny’.
Results
Respondents who found the texts ‘interesting’ tend to get better quiz results, with the pairwise correlation coefficient between the two variables of 0.65 (see Table 1). Familiarity with economics and the ability of the text to capture the reader’s attention also correlate positively with the quiz results (yet moderately, with correlation coefficients of 0.36 and 0.34, respectively).
Table 1 reports the pairwise correlations for our variables of interest – quiz results, familiarity with economics, attention, excitement, interest, humor and fun – for the full sample of 52 respondents. Statistical significance is denoted by *** (at 1%), ** (at 5%) and * (at 10%).
A text that is perceived as ‘exciting’ appears to be more successful in capturing attention (with a full sample correlation coefficient of 0.75 between the two variables) than a text that is labeled ‘interesting’ (where correlation with attention is only 0.36). This difference suggests that the personal (potentially more emotionally charged) endorsement of ‘exciting’ carries more weight than the more detached (intellectual) characterization of ‘interesting’. A similar distinction in terms of the respondents’ personal experience may be inferred from the assignment of a ‘funny’ versus ‘humorous’ label to the text (which, just like the ‘excitement’ and ‘interest’ variables correlate strongly, but not overwhelmingly, at 0.65). An unsuccessful attempt to amuse may still be recognized as ‘humorous’ even when it falls short of ‘funny’, as suggested by the fact that the average ratings are higher for the ‘humor’ variable than for the ‘fun’ variable within each of the groups (see Table 2). In the same vein, five of the 26 respondents in the control group (who received the input text) gave a high rating (3 or above) for the ‘humorous’ variable, while, unsurprisingly, none found it ‘funny’ (all the ratings for the ‘fun’ variable are either 1 or 2) (see Table 3). Notably, we do not find any evidence of a detrimental effect of instructional humor on learning, as the difference between the average quiz results of the control and test group (78.51 and 75.79) is very small and highly insignificant (with a p-value of 0.60).
Table 2 reports the number of observations and average values for our variables of interest – quiz results, familiarity with economics, attention, excitement, interest, humor and fun – for the full sample as well as for the control and test groups. P-values for tests of significance for the difference in the means of the variables for the control versus test groups are reported both for the standard t-test (assuming equal population variances) and the Satterthwaite-Welch t-test, which allows unequal variances in the two populations. Statistical significance is denoted by *** (at 1%), ** (at 5%) and * (at 10%).
Table 2 shows that the average grade for ‘fun’ is the lowest of all the variables measured, at only 1.96. Yet it seems there is potential in successfully striking the ‘fun’ chord, as the average quiz results of the seven respondents who gave high marks (3 or above) for the ‘fun’ component are the highest of all (87.39), slightly exceeding the results of the respondents who were very familiar with economics (86.93 in the control group and 86.63 in the test group). It is plausible that familiarity with economics plays a role in humor appreciation as well as in performing well in the quiz. However, we note that the seven respondents who gave GPT-4 high marks for humor have on average lower familiarity with economics (3.14) and slightly higher performance than the respondents in the test group who are familiar with economics (3.63). Given the size of the sample, results should be interpreted with caution, but the fact that the respondents from the test group who declare themselves not amused get significantly lower quiz results, with an average of 71.52 (see Table 4), is encouraging for our second hypothesis: humor appears to make a difference in learning only when it is perceived as actually funny. The average result of the students who gave low marks to the ‘fun’ component – just like the average result for the respondents in the same group who gave low ratings for the ‘humor’ content (73.01) – is comparable (more often than not, favorably) with the average results obtained in the ‘low’ sections for ‘familiarity with economics’, ‘attention’, ‘excitement’ and ‘interest’ of both the control and the test groups.
Table 3 reports the number of respondents that give high (3 or above) versus low (1 or 2) ratings to the following variables – familiarity with economics, attention, excitement, interest, humor and fun – for the full sample as well as for the control and test groups.
Table 4 reports average quiz results for the subgroups that give high (3 or above) and low (1 or 2) ratings to the following variables – familiarity with economics, attention, excitement, interest, humor and fun for the full sample as well as for the control and test groups. P-values for tests of significance for the difference in the means of the variables for the control versus test groups are reported for both the standard t-test (assuming equal population variances) and the Satterthwaite-Welch t-test, which allows unequal variances in the two populations. Statistical significance is denoted by *** (at 1%), ** (at 5%) and * (at 10%).
Therefore, while we may conclude that being amused correlates with significantly better results, not being amused does not accompany a worse performance than not being attentive or not finding the topic interesting, for instance. Moreover, we do not find any support for the claim that humor distracted the respondents, since the average levels of attention (3.12) and the number of students in the ‘high attention’ versus ‘low attention’ groups (18 and 8) are exactly the same for the control and the test group. Finally, we note that the worst performances belong to the eight students in the test group who gave low marks to the ‘attention’ variable (with an average quiz result of 62.50) and the five students in the control group who gave high marks to the ‘humor’ variable (with an average quiz result of 64.71). In conclusion, and consistent with IHPT, we add a qualification to our second hypothesis: instructional humor may enhance learning if learners are genuinely amused. Our results are suggestive, but as they rely on the performance of a subset of only seven respondents, further research (at a larger scale) becomes imperative if our second hypothesis is to be confirmed.
Concluding remarks and further work
This paper has explored the potential of large language models, in particular OpenAI’s GPT-4, to contribute to the field of economics education by incorporating computational humor into instructional texts. The results suggest that GPT-4 can be successful in producing adequate instructional humor in an entirely autonomous fashion and that respondents who find the instructional text amusing achieve significantly higher quiz results. However, our prima-facie findings are small-sample results, and merely open avenues for further testing and potential confirmation.
If confirmed, these results point toward at-scale, AI-driven personalization of instructional humor. The education sector – comprising both longer established institutions and edtechs – is actively looking for guidance on harnessing the power of AI to benefit each learner. Research going beyond the present study could be invaluable in this regard. Given a learner’s characteristics, at what points is it best to inject AI-generated instructional humor? Are some humor types (analogy, hyperbole, irony, word play) better than others in particular contexts? In the future, we can expect virtual instructors delivering educational content with human-like realism. How does humor impact learning in such a setting? How will the effectiveness of AI-generated instructional humor change if it is founded on humor algorithms from an experienced comedy writer (Toplyn, 2023)? Further technological developments will make it possible to examine the impact of AI-generated humor on learning in far greater detail. For example, non-invasive and easy-to-use electroencephalography technology has already been used to obtain experimental subjects’ focus dynamics at high temporal resolution, and can also be used to study enjoyment, anxiety, ‘flow’ state, memory formation etc. (Haruvi et al., 2022).
It is worth addressing an important objection sometimes encountered to the very premise of instructional humor: learners should be motivated enough that ‘sugar-coating’ with humor is not necessary. We agree that there are some learners who are fully committed and do not require such help. Others, however, should not be neglected – especially as they are likely to include disproportionate numbers of the educationally underprivileged, uninspired by the notion of learning for its own sake. Further, even the most committed learner in a main field may be less committed when it comes to other subjects. Our paper argues that, for these learners, AI-generated humor may be helpful (or at least not harmful). Such ‘sugar-coating’ is a form of temptation-bundling, which has increasingly been shown to be beneficial in a variety of contexts; for example, Milkman et al. (2014) show gym attendance to increase if going to the gym is bundled with listening to engaging audiobooks. (But note that, while listening to audiobooks makes a workout more enjoyable without making it more impactful, instructional humor may not only make the learning process more enjoyable, but also provide additional insights into the material and/or make it more memorable.) Thus, judicious use of AI-generated instructional humor can point the way toward making previously forbidding subject matter accessible and enjoyable to many learners who would otherwise be left behind.