Artificial Intelligence

Do AI Models like GPT Really Get the Joke?

Award-winning research probes AI's ability to understand humor.

Posted August 21, 2023 | Reviewed by Kaja Perina

“Humor is the ability to see three sides to one coin.” - Ned Rorem, American Composer (1923-2022)

Source: Volzi/Pixabay

A winner of the “Best Paper Award” that recently presented at the 61st Annual Meeting of the Association for Computational Linguistics (ACL’23) held in July in Toronto, Canada takes a scientific approach to probing the ability of artificial intelligence (AI) to comprehend humor.

“Large neural networks can now generate jokes, but do they really “understand” humor?” asked lead author Jack Hessel, PhD, a research scientist at the Allen Institute for AI (AI2), along with co-authors Ana Marasović, PhD, an Assistant Professor in the Kahlert School of Computing at the University of Utah, Jena Hwang, PhD, a Research Engineer at AI2, Lillian Lee, PhD, the Charles Roy Davis Professor at Cornell University, Jeff Da of Amazon, Rowan Zellers, PhD, researcher at OpenAI, renowned humorist Robert Mankoff, President of Cartoon Collections, Cartoon Editor of the digital weekly magazine Air Mail and long-time Cartoon Editor at The New Yorker magazine, and Yejin Choi, PhD, Associate Professor at the University of Washington and Senior Research Manager at the Allen Institute for Artificial Intelligence.

The data used for this study is from over 700 weeks during 14 years of weekly caption contests from The New Yorker. For the caption contests, the readers are asked to send funny captions for a cartoon, and the winning caption is voted on by readers from the top three captions selected by the editors out of up to thousands of submissions. Additionally, they used quality estimates from crowdsourcing for some contests.

“These tasks are difficult because the connection between a winning caption and image can be quite subtle, and the caption can make playful allusions to human experience, culture, and imagination,” wrote the scientists.

Using this data, the researchers tested various AI models’ ability to pair cartoons to jokes, spot the winning caption, and explain why the caption paired with an image is humorous by using an image approach using pixels and AI computer vision with models having access to cartoon images, or a description approach with human-authored text summaries of cartoons.

“We find that both types of models struggle at all three tasks,” the researchers reported.

The researchers discovered that there’s much room for improvement for AI to come close to achieving a human-level understanding of humor. For the pixels approach, a fine-tuned image and text model CLIP ViT-L/14 @ 366 px and OFA Huge, a pretrained model that unifies modalities (such as vision and language) and tasks to a simple sequence-to-sequence learning framework.

The best performing AI model for the pixel approach, the CLIP ViT-L/14, only had an accuracy of 62%, which is much less than the 94% achieved by humans on pairing captions to cartoons in the pixel approach.

In the description approach, GPT-4 (5-shot) achieved the highest accuracy with 84.5% on the pairing captions to cartoon task, outperforming T5-Large, T5-11B, fine-tuned GPT3-17B, and GPT 3.5 (5-shot).

For the task of predicting The New Yorker editor’s top three captions, fined-tuned GPT-3 achieved 69.8% accuracy and GPT-4 achieved 68.2% accuracy, which was only slightly above the 64.6% accuracy achieved by human estimate. For predicting crowd picks, human estimate performed the best with 83.7% accuracy, followed by GPT-4 with only 73.3% accuracy.

Moreover, when it came to explaining jokes, the researchers found that even the best performing AI model, GPT-4, fell short of human-written explanations.

“We demonstrate that today’s vision and language models still cannot recognize caption relevance, evaluate (at least in the sense of reproducing crowd sourced rankings), or explain The New Yorker Caption Contest as effectively as humans can,” the researchers reported.