AI is rubbish at answering a simple question which is easily solvable by children, say researchers.
Scientists slammed the nonsensical responses from the likes of OpenAI's GPT as “overconfident in their wrong solutions.”
There was a dramatic breakdown of function and reasoning capabilities of state-of-the-art AI models, the researchers reported in a new paperCredit: GettyScientists at the AI research nonprofit LAION conducted the research, with their findings published in a paper, which has yet to be peer-reviewed.
The testing hinged on a so-called "Alice in Wonderland problem."
The aim was to check straightforward reasoning, using basic maths.
Artificial intelligence reaches major milestone 'for the first time ever'Various artificial intelligence models were asked to solve this question: "Alice has [X] brothers and she also has [Y] sisters. How many sisters does Alice's brother have?"
"Though the problem requires a bit of thought, it's not exactly bridge troll riddle-level hard," said science and tech news site Futurism.
"The answer, naturally, is however many sisters Alice has, plus Alice herself. So if Alice had three brothers and one sister, each brother would have two sisters."
The researchers tried the question on OpenAI's GPT-3, GPT-4, and GPT-4o models, Anthropic's Claude 3 Opus, Google's Gemini, and Meta's Llama models, as well as Mistral AI's Mextral, Mosaic's Dbrx, and Cohere's Command R+.
The problem is no challenge for most adults, and probably not hard to solve if posed to children above a certain age.
Researchers
"Only one model, the brand new GPT-4o, received a success rate that, by standardized school grades, was technically passing," said Futurism.
ALICE IN WONDERLAND
The LAION researchers' paper, published last week, is titled: Alice in Wonderland: Simple tasks showing complete reasoning breakdown in state-of-the-art large language models.
Large Language Models (LLMs) are a type of artificial intelligence model that are trained on vast amounts of text data.
These are created to understand and generate human-like text, and are used in a variety of applications such as translation services and chatbots.
"ChatGPT, developed by OpenAI, is a prime example of a SOTA (state-of-the-art) LLM," said ChatGPT Guide.
Inside home of the future - including AI baby cribLAION-affiliated scientists from across the globe, including the UK and Germany, probed claims that artificial intelligence excels in tricky tasks.
But what they found was "a dramatic breakdown of function and reasoning capabilities of state-of-the-art models," the researchers said.
AI LIED ABOUT RESULT
The models were given the so-called Alice in Wonderland question - "a simple, short, common sense problem formulated in concise natural language, easily solvable by humans."
However, even though they mucked up their answers, the AIs "expressed strong overconfidence in their wrong solutions, while providing nonsensical reasoning-like explanations," the paper added.
What's more, they tried to fib in an attempt to "justify and back the validity of their clearly failed responses, making them sound plausible."
The team has urged scientific and technological community to "urgently reassess the claimed capabilities" of the current generation of machine learning models that can comprehend and generate human language text.
"Such reassessment also requires action to create standardized benchmarks that would allow proper detection of such basic reasoning deficits that obviously manage to remain undiscovered by current state-of-the-art evaluation procedures and benchmarks."
The AI models keep producing more nonsense.
Researchers
The team said it had given the AI models varying versions of the simple Alice in Wonderland question.
"The problem has a light quiz style and is arguably no challenge for most adults, and probably... not hard to solve via common sense reasoning if posed to children above a certain age," the paper added.
When it came to the AI models trying to deceive people into believing their incorrect responses, the scientists warned of the "dramatic breakdown."
"Explanations may mislead readers into thinking that there might be sound reasoning behind the wrong answers, or at least stir confusion.
"The breakdown appears dramatic because when attempting to fix the failures... the models keep producing more nonsense, often in lengthier and sometimes more entertaining form, leading stubbornly to the same wrong final answers.
"We conclude that the capabilities of the current generation of state-of-the-art large language models [such as ChatGPT] to perform even simple reasoning on common sense tasks are heavily compromised.
"Current language model benchmarks, especially those aiming on measuring reasoning capabilities, do not properly reflect such weaknesses."