A research team from the Polytechnic University of Valencia in Spain has discovered that as large language models (LLMs) become larger and more sophisticated, they tend to be less forthcoming with users about their inability to provide answers.
In a study published in the journal Nature, the team examined the latest versions of the three most popular AI chatbots regarding their responses, accuracy, and the likelihood of users detecting incorrect answers.
As LLMs become larger and more sophisticated, they tend to be less open. (Illustrative image).
To test the accuracy of the three most popular LLMs—BLOOM, LLaMA, and GPT—the research team posed thousands of questions and compared the answers received with the responses from previous versions for the same inquiries.
They also diversified the topics, including mathematics, science, word puzzles, and geography, as well as the ability to generate text or perform actions like sorting lists.
The research findings revealed several notable trends.
Overall, the accuracy of the chatbots improved with each new version, yet it declined when faced with more challenging questions.
Surprisingly, as LLMs become larger and more sophisticated, they tend to be less open about their ability to provide accurate answers.
In earlier versions, most LLMs would candidly inform users when they could not find an answer or needed additional information.
In contrast, newer versions tend to guess more, resulting in a greater number of responses overall, including both correct and incorrect answers.
Even more concerning, the study found that all LLMs occasionally provided inaccurate answers even to easy questions, indicating that their reliability remains an issue that needs improvement.
These findings highlight a paradox in the development of AI: despite becoming more powerful, models may also become less transparent about their limitations.
This presents new challenges in using and trusting AI systems, requiring users to be more cautious and developers to focus on improving not only accuracy but also the models’ “self-awareness.”