A team from the Universitat Politècnica de València, part of the Valencian University Research Institute for Artificial Intelligence (VRAIN) and ValgrAI, has participated in the development of ADeLe, a new methodology that offers precise explanations and predictions regarding whether large language models (LLMs) will succeed or fail at specific new tasks they have not yet performed. Furthermore, this methodology identifies exactly the limits of any given model's reasoning capacity.
The findings of this study, published today in the journal Nature, represent a major breakthrough, as current methodologies only indicate how an AI model behaves in a specific test. ADeLe, with a more cognitive evaluation, explains and predicts the behaviour of models a priori, allowing errors to be anticipated before industries launch new AI models. And, therefore, it is possible to anticipate where it will fail before discovering it as it goes along.
With this more cognitive evaluation, "for the first time, we can predict with around 90% accuracy whether an AI model will solve a new task or not, before deploying it. For industry, this means detecting faults in time and avoiding the high costs of launching a system that does not perform as expected," explains Fernando Martínez-Plumed, a researcher at VRAIN at the UPV.
Breakthrough in the rigorous evaluation of AI capabilities
Given the current pace and spread of AI, this is a highly significant breakthrough for researchers, companies, external evaluators, policymakers, and regulators who have been calling for a rigorous, scalable, and standardised evaluation of AI capabilities, including during security audits.
As stated in the article, "to date, AI evaluation has not met the demands of a rapidly evolving and increasingly diverse AI ecosystem. Understanding and anticipating performance has become an urgent requirement for a wide range of general-purpose AI systems". This new methodology is comprehensive and scalable, addressing the shortcomings of conventional AI evaluation, including the lack of explanatory and predictive power.
18 cognitive dimensions
The study was jointly conducted by José Hernández-Orallo, Professor of Computer Science and VRAIN researcher at the UPV and member of the ValgrAI UMI; Fernando Martínez-Plumed, a senior lecturer in Computer Science and VRAIN researcher at the UPV; PhD students Yael Moros-Daval and Kexin Jiang-Chen, a VRAIN researcher at the UPV; and Behzad Mehrbakhsh, a PhD student at ValgrAI and VRAIN at the UPV.
The key to the new research goes beyond measuring aggregate accuracy by extracting a set of broad capability dimensions, enabling predictions that can be transferred to unknown tasks.
The new system organises the wide range of cognitive tasks faced by large AI language models into just 18 key dimensions, including attention, reasoning and the degree of task uniqueness. It then scores any real-world task on each of these dimensions, based on how much it demands of that specific capability. By having a model perform a sufficient number of these scored tasks, according to their level of difficulty, a capability profile is obtained.
Key findings
Using ADeLe, the research team evaluated numerous AI performance benchmarks. It drew four key conclusions: firstly, that current AI performance benchmarks do not measure what they are intended to measure, as they often assess other capabilities for which they were not designed. Secondly, AI models exhibit distinct patterns of strengths and weaknesses across different capabilities, depending on their size, reasoning methodology and model family. Thirdly, the new ADeLe system provides accurate explanations and predictions regarding whether AI systems will succeed or fail at a specific new task. And finally, they highlight that conflicting research on whether AI models are capable of reasoning is partly correct, but refers to different levels of difficulty. Some current AI performance tests require only basic problem-solving, whilst others require advanced logic, abstraction and deep domain knowledge.
The authors state in a summary of the findings that "the clearest picture offered by ADeLe is as follows: reasoning models (such as OpenAI's o1) show real and quantifiable improvements over standard models, not only in logic and mathematics, but also in surprising areas such as understanding what a user is actually asking".
The study, entitled "General Scales Unlock AI Evaluation with Explanatory and Predictive Power", was conducted jointly by researchers from the University of Cambridge, the Universitat Politècnica de València, Princeton, Carnegie Mellon and William & Mary, together with professionals from Microsoft Research and the Centre for Automation and Robotics (CAR, CSIC-UPM), amongst other institutions.