Study: Large language models still lack general reasoning skills

Date:

Share:

Large language models like GPT-4, the one behind ChatGPT, train on vast stores of data to complete one task: Produce a convincing sequence of words in response to a user’s written request. The tools seem to do more than that, however. Researchers have reported extensively on the models’ apparent abilities to complete tasks that require reasoning, like predicting the next letter in a sequence or solving logic puzzles after being provided with the rules. But whether those models demonstrate actual reasoning or employ clever short cuts — by finding a similar text in the training data, for example — remains an open question.

Research by a pair of SFI researchers, published in February in Transactions on Machine Learning Research, challenges the notion that GPT-4 achieves robust humanlike reasoning. In the work, SFI Professor Melanie Mitchell and former SFI Research Fellow Martha Lewis (University of Amsterdam) tested GPT-4’s ability to work through a variety of analogy puzzles. But they added a twist: They altered the puzzles from their original form to create analogous versions, inventing new alphabets out of symbols, for example. Then they compared the solving ability of GPT-4 and other models to that of humans.

“Humans were essentially able to solve the analogies,” said Mitchell. “GPT models found that a lot more difficult.” 

The work suggests that ChatGPT may not be performing broad reasoning when it solves such puzzles. Recognizing the limitations of these tools is critical to knowing when and to what extent they can be trusted, said Mitchell, whose work focuses on analogical reasoning, which is the ability to draw conclusions and inferences from one situation and apply it to another. 

The duo tested two GPT models on three kinds of analogies, which can reveal how people connect one abstract idea to another. This kind of reasoning is fundamental to human cognition, Mitchell said, and should be a central focus of studying AI. Lewis came to SFI to work with Mitchell, and their collaboration led to these studies.

The first type of analogy they used were simple puzzles involving letters. “I might say, if ABC goes to ABD, what does IJK go to?” asked Lewis. The answer, which both humans and GPT models could solve, was IJL. But, said Lewis, when they varied the problem by positing “fictional” alphabets with different letter orderings, humans could still solve it. GPT models faltered. “You need general reasoning abilities to solve these problems,” Lewis said. 

Another type of problem asked participants to fill in the blanks in numerical grids; the third asked participants to recognize similarities among short stories, each only a few sentences long. In most cases, modifying the puzzles from their original format stymied the model. 

The work suggests that claims to general reasoning abilities on the part of large language models may be premature. Even though models performed well on particular tasks in early studies, “if you poke them more, they don’t hold up,” said Mitchell. “Just because one of these models does well on a particular set of tasks doesn’t necessarily mean it’s going to be robust.” 

Both Lewis and Mitchell said it’s important for researchers to develop and implement tests for robustness, which describes the ability to adapt to small changes in problems or situations. New tests for robustness should include benchmarks, or a set of agreed-upon standard tasks, that can assess how well AI systems, as well as humans, can adapt to novel situations. Robustness tests, they said, would give users a way to gauge the trustworthiness of large language models. Robustness is important, said Lewis, because in practice, these models will regularly encounter new circumstances and challenges unlike the data they’ve already analyzed. 

“The world is so much richer than even their training data,” Lewis said.

Read the paper “Evaluating the Robustness of Analogical Reasoning in GPT Models” in Transactions on Machine Learning Research, February 2025.

Read More

Source link

Subscribe to our magazine

━ more like this

Intercom on Product: Designing for Clarity

As Intercom’s product scaled, we struggled to keep the navigation clear and user-friendly. So how do you redesign information architecture without disrupting users? Senior Product...

FBI, Justice Department Buildings Briefly Listed For Sale

The Trump administration briefly listed over 440 federal buildings for sale online before suddenly removing the document, which included major properties like the FBI...

Why OPI is My Favorite Nail Polish

If you've been a long-time reader of this blog you'll know how I have an especially warm spot in my heart for nail polish....

I Tried Prime Video’s New AI Dubbing, and I Have Thoughts

We may earn a commission from links on this page. ...