Skip to main content

Verified by Psychology Today

Can AI and Genomics Predict the Next COVID Variant?

An AI machine learning algorithm was trained to predict new COVID-19 variants.

Key points

  • A new study shows how AI and genomics can predict future mutations of the SARS-CoV-2 virus.
  • The scientists partitioned the genetic samples into clusters, then analyzed the fitness of each cluster.
  • According to the researchers, their algorithm can be applied to different viral phenotypes as well as any viral genomic dataset.
Peter-Gamal/Pixabay
Source: Peter-Gamal/Pixabay

The predictive capability of artificial intelligence (AI) machine learning is accelerating discoveries in life science.

A new study shows how AI and genomics can predict future mutations of the SARS-CoV-2 virus that causes the COVID-19 disease.

“The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) pandemic has been characterized by waves of transmission initiated by new variants replacing older ones,” wrote the Broad Institute of MIT and Harvard research team, with their co-authors from the University of Massachusetts Medical School and other affiliations. “Given this pattern of emergence, there is an obvious need for the early detection of novel variants to prevent excess deaths.”

The research team developed a hierarchical Bayesian regression AI model called PyR0 that can provide scalable analytics of the complete set of public datasets of SARS-CoV-2 genomes. The Bayesian model predicts emerging viral lineages.

The algorithm used is fully Bayesian. As distinct from frequentist linear regression, Bayesian linear regression uses probability distributions instead of point estimates, and the output is generated from a normal (Gaussian) distribution. The goal of Bayesian linear regression is to find the posterior distribution for the model parameters instead of finding the single optimal value of the model parameters.

Through systematic backtesting, we found that the model would have provided early warning and aided in the identification of VoCs had it been routinely applied to SARS-CoV-2 samples, confirming its utility for public health and underscoring the value of rapid sharing of genomic data.

The AI model was fit to 6,466,300 SARS-CoV-2 genomic data from GISAID (Global Initiative on Sharing All Influenza Data). The team used stochastic variational inference to fit the large model. Even with this approach, this complex task required solving an optimization problem with over 75 million dimensions.

The scientists partitioned the genetic samples into clusters, then analyzed the fitness of each cluster. Specifically, the team created 3,000 clusters from 1544 PANGO lineages and modeled the fitness of lineages separately across 1,560 geographies. The study authors reported:

The model correctly infers World Health Organization classification variant Omicron (PANGO BA.2) to have the highest fitness to date: 8.9 times [95 percent confidence interval (CI) 8.6 to 9.2] higher than the original A lineage, accurately foreshadowing its rise in regions where it is circulating.

According to the researchers, their algorithm can be applied to different viral phenotypes as well as any viral genomic dataset.

“Using this model, emerging lineages can be spotted together with the mutations that contribute toward transmissibility, not only in Spike but also in other viral proteins,” the authors reported. “The model can prioritize lineages as they emerge for public health concern.”

Copyright © 2022 Cami Rosso. All rights reserved.

advertisement