In the field of biomedicine and public health, continuous viral mutation and evolution may enable viruses to cross species barriers, infect non-natural hosts, and subsequently trigger human-to-human transmission or even global pandemics. Historically, multiple major outbreaks, such as COVID-19 and influenza pandemics, have been caused by zoonotic viruses. Therefore, in the face of potential threats from unknown viruses, developing intelligent models capable of rapidly assessing their adaptability and transmission risks at the genotypic level has become a forefront challenge in infectious disease prevention and control.
Traditional experimental methods for viral risk identification, while reliable, are time-consuming and low-throughput, making it difficult to perform real-time and prospective risk assessments on large-scale viral sequence data. In recent years, artificial intelligence technologies have demonstrated potential in predicting phenotypes such as receptor binding, host adaptation, and evolutionary escape based on viral gene or protein sequences. However, existing models are primarily designed for specific viruses or genes, with strict limitations on sequence type and length, thereby lacking generalizability for diverse unknown viruses. Moreover, the scarcity of phenotypic labels and the presence of annotation noise in public databases significantly constrain the performance of supervised learning models. Thus, under conditions of incomplete labeling, constructing an accurate, highly generalizable, and directly applicable intelligent framework for predicting the adaptation risk of unknown viruses represents a critical challenge in the field.
In summary, developing a general artificial intelligence approach that does not fully rely on predefined labels and can perform adaptation risk prediction will provide essential technical support for the early warning and control decision-making of emerging infectious diseases, holding significant theoretical value and practical application prospects.
Research Progress
A research team led by Prof. Tao Jiang and Prof. Jing Li from State Key Laboratory of Pathogen and Biosecurity, Academy of Military Medical Science, in collaboration with Prof. Shi-Shun Zhao from College of Mathematics, Jilin University and Prof. Jianwei Wang from NHC Key Laboratory of Systems Biology of Pathogens and Christophe Merieux Laboratory, National Institute of Pathogen Biology, Chinese Academy of Medical Sciences and Peking Union Medical College, developed a viral risk prediction framework named GIVAL based on the pre-trained viral protein language model vBERT. Prof. Jing Li, Prof. Tao Jiang and Prof. Jianwei Wang are the corresponding authors of this paper, and the first author is Shu-Yang Jiang, a Ph.D. candidate from College of Mathematics, Jilin University. This study proposes a general intelligent prediction method for assessing adaptation risks of unknown viruses (Fig. 1).
First, viral protein sequences were tokenized dynamically using a Hidden Markov Model (HMM). The viral protein language model vBERT, trained with statistical sampling of viral genome sequences and HMM-based dynamic tokenization, demonstrated performance superior to mainstream pre-trained models such as DNABERT-2, proteinBERT, and ESM-2 in benchmark tests (Fig. 2A-E).
Second, based on vBERT embeddings, a semi-supervised general AI framework named GIVAL was established for predicting adaptation risks of unknown viruses, and the full pipeline was systematically evaluated. The semi-supervised learning approach endowed GIVAL with higher prediction accuracy and greater tolerance to labeling errors, enabling reliable modeling and accurate prediction for unknown input sequences under label-deficient conditions (Fig. 2F-K).
Finally, GIVAL successfully identified the reported shift in receptor binding of two Middle East respiratory syndrome coronavirus (MERS-CoV) related strains, discerned adaptation differences between canine and equine H3N8 influenza viruses, inferred high-risk mutations in H5N1 influenza viruses (Fig. 3), and assessed recent adaptation shift in monkeypox viruses.
Future Prospects
The innovative general artificial intelligence framework for viral risk prediction proposed in this study enables intelligent assessment of potential risks from future unknown viruses. Even under conditions of incomplete viral sequences and scarce annotated data, it can achieve high-precision and highly robust risk evaluation, thereby providing critical decision support for early warning and proactive prevention and control of viral infectious diseases.
The complete study is accessible via DOI:10.34133/research.0871