Chemical language models don't need to understand chemistry

A study by the University of Bonn proves that transformer models used in chemistry learn only statistical correlations

Language models are now also being used in the natural sciences. In chemistry, they are employed, for instance, to predict new biologically active compounds. Chemical language models (CLMs) must be extensively trained. However, they do not necessarily acquire knowledge of biochemical relationships during training. Instead, they draw conclusions based on similarities and statistical correlations, as a recent study by the University of Bonn demonstrates. The results have now been published in the journal Patterns.

Large language models are often astonishingly good at what they do, whether that's proving mathematical theorems, composing music, or drafting advertising slogans. But how do they arrive at their results? Do they actually understand what constitutes a symphony or a good joke? It is not so easy to answer that question. „All language models are a black box,“ emphasises Prof. Dr Jürgen Bajorath. „It's difficult to look inside their heads, metaphorically speaking.“

Nevertheless, Jürgen Bajorath, a cheminformatics scientist at the Lamarr Institute for Machine Learning and Artificial Intelligence at the University of Bonn, has attempted to do just that. Specifically, he and his team have focused on a special form of AI algorithm: transformer CLM. This model works in a similar way to ChatGPT, Google Gemini and Elon Musk's 'Grok' that are trained using vast quantities of text, enabling them to generate sentences independently. CLMs, on the other hand, are usually based on significantly less data. They acquire their knowledge from molecular representations and relationships, e.g. the so-called SMILES strings. These are character strings that represent molecules and their structure as a sequence of letters and symbols.

Systematic manipulation of training data

In pharmaceutical research, scientists often attempt to identify substances that can inhibit certain enzymes or block receptors. CLMs can be used to predict active molecules based on the amino acid sequences of target proteins. “We used sequence-based molecular design as a test system to better understand how transformers arrive at their predictions,” explains Jannik Roth, a doctoral student working with Bajorath. “After the training phase, if you introduce a new enzyme to such a model, it may produce a compound that can inhibit it. But does that mean that the AI has learned the biochemical principles behind such inhibition?“

CLMs are trained using pairs of amino acid sequences of target proteins and their respective known active compounds. In order to address their research question, the scientists systematically manipulated the training data. “For example, we initially only fed the model specific families of enzymes and their inhibitors,” explains Bajorath. „When we then used a new enzyme from the same family for testing purposes, the algorithm actually suggested a plausible inhibitor.“ However, the situation was different when the researchers used an enzyme from a different family in the test, i.e. one that performs a different function in the body. In this case, the CLM failed to correctly predict active compounds.

Statistical rule of thumb

“This suggests that the model has not learned generally applicable chemical principles, i.e. how enzyme inhibition usually works chemically,” says the scientist. Instead, the suggestions are based solely on statistical correlations, i.e. patterns in the data. For example, if the new enzyme resembles a training sequence, a similar inhibitor will probably be active. In other words, similar enzymes tend to interact with similar compounds. „Such a rule of thumb based on statistically detectable similarity is not necessarily a bad thing,’ emphasises Bajorath, who leads the area „AI in Life Sciences and Health“ at the Lamarr Institute. „After all, it can also help to identify new applications for existing active substances.“

However, the models used in the study lacked biochemical knowledge when estimating similarities. They considered enzymes (or receptors and other proteins) to be similar if they matched 50–60 percent of their amino acid sequence, and accordingly suggested similar inhibitors. The researchers could randomize and scramble the sequences at will, as long as sufficient original amino acids were retained. However, often only very specific parts of an enzyme are necessary for it to perform its task. A single amino acid change in such a region can render an enzyme dysfunctional. Other areas are more important for structural integrity and less relevant for specific functions. “During their training, the models did not learn to distinguish between functionally important and unimportant sequence parts,” emphasises Bajorath.

Models simply repeat what they have read before

The results of the study therefore show that the transformer CLMs trained for sequence-based compound design lack any deeper chemical understanding, at least for this test system. In other words, they merely recapitulate, with minor variations, what they already have picked up in a similar context at some point. “This does not mean that they are unsuitable for drug research,” emphasises Bajorath, who is also a member of the Transdisciplinary Research Area (TRA) “Modelling” at the University of Bonn. „It is quite possible that they suggest drugs that actually block certain receptors or inhibit enzymes.“ However, this is certainly not because they understand chemistry so well, but because they recognise similarities in text-based molecular representations and statistical correlations that remain hidden from us. This does not discredit their results. However, they should not be overinterpreted either.'

Participating institutions and funding

The work was financially supported by the German Academic Scholarship Foundation.

Publication: Jannik P. Roth, Jürgen Bajorath: Unraveling learning characteristics of transformer models for molecular design, Patterns, https://doi.org/10.1016/j.patter.2025.101392, URL: https://www.cell.com/patterns/fulltext/S2666-3899(25)00240-5

Archivos adjuntos

Bajorath-Roth_13-10-2025_gh_02.jpg: Prof. Dr. Jürgen Bajorath and doctoral student Jannik P. Roth from Life Science Informatics at the University of Bonn. Photo: Gregor Hübl/University of Bonn
transformer.jpg: Schematic representation of a transformer model for predicting new compounds from protein sequence data. Grafik: J. P. Roth und J. Bajorath

15/10/2025 Universität Bonn

Regions: Europe, Germany

Keywords: Science, Chemistry, Applied science, Artificial Intelligence

Disclaimer: AlphaGalileo is not responsible for the accuracy of content posted to AlphaGalileo by contributing institutions or for the use of any information through the AlphaGalileo system.

Publicaciones más recientes

Testimonios