Skip to main content

Verified by Psychology Today

Artificial Intelligence

State-of-the-Art AI Predicts Gene Activity in Human Cells

AI foundation model may help genetics, cancer, and complex disease research.

Maxwell_joe/Pixabay
Source: Maxwell_joe/Pixabay

Human health may be getting a big boost from the bits and bytes of computer science. In particular, artificial intelligence (AI) machine learning models are helping to unravel the mysteries of the human genome for potentially life-saving treatment of genetic and complex diseases. This week, Columbia University scientists and their colleagues published a peer-reviewed study in Nature that unveils an AI foundation model capable of predicting gene activity across many different human cell types.

Gene expression is an essential process that happens inside cells to translate genetic information into usable products such as proteins that are important for the development, structure, and function of organisms. It is the process that converts genetic information encoded in DNA into RNA and amino acids.

To predict gene expression, it is critical to account for transcriptional regulation. When transcriptional regulation fails to perform properly, unsuitable patterns of gene expression happen which can result in disease. For example, a different study by Princeton University researchers Ell and Kang shows how transcriptional regulation has a key part in cancer tumor progression and metastasis.

“In this study, we introduce GET, a state-of-the-art foundation model specifically engineered to decipher mechanisms governing transcriptional regulation across a wide range of human cell types,” wrote senior author Raul Rabadan, PhD, a professor at the Departments of Systems Biology, Biomedical Informatics and Surgery and director of both the Program for Mathematical Genomics and the Center for Topology of Cancer Evolution and Heterogeneity at Columbia University, along with a team of research partners.

In the fields of molecular genetics and genomics, having predictive capabilities for transcriptional regulation is important because it plays a vital role in controlling gene expression. However, existing AI models of transcription lack robustness according to the Columbia University researchers and their research colleagues.

“Computational models of transcription lack generalizability to accurately extrapolate to unseen cell types and conditions,” the researchers wrote.

In artificial intelligence machine learning, the term “generalizability” refers to the ability of an AI algorithm to make predictions on entirely new data that it has not been exposed to prior. The more robust an AI algorithm is, the better it can make predictions on novel, previously unseen data.

The Columbia University paper points out that the AI transformer model Enformer, as well as deep convolutional neural network models Basenji2 and Expecto, perform predictions on the training cell types post fine-tuning, thus by design they are limited in use and ability to generalize.

How to tackle this challenge? The scientists look to the recent AI breakthroughs with state-of-the-art foundation models.

“With extensive pretraining on broad and diverse datasets, foundation models provide a generalized understanding of their training data, upon which specialized adaptations can be built to address specific tasks or challenges,” the researchers wrote.

In computer science, AI foundation models are large, generative, deep learning neural networks that are pre-trained using massive amounts of broad, unlabeled data that can be used for a variety of tasks, not just a single purpose.

“Recently, foundation models such as GPT-4 and ESM-2 have emerged as a transformative approach,” wrote the study authors.

OpenAI’s GPT-4 is a transformer-style AI model that can transact with both images and text (multimodal) as prompts in order to generate text output. Evolutionary Scale Model (ESM-2) created by Meta Fundamental AI Research Protein Team (FAIR) researchers is a pretrained large language model for proteins.

The scientists highlight other genomic research studies using AI foundation models such as scGPT, a generative transformer for multi-omics based using single-cell sequencing data that was pretrained on data from over 33 million cells, scFoundation (also known as xTrimoscFoundationα), a transformer for single-cell analysis pretrained on more than 50 million human single-cell transcriptomic data, and Geneformer, a transformer model pretrained on roughly 30 million single-cell transcriptomes.

What sets this current study apart from other studies is that the Columbia University scientists and their research partners deliberately trained their AI transformer model using data from normal tissue, instead of diseased human cells. The GET algorithm learned features relevant to predicting gene expression from the massive amounts of training data consisting of over 1.3 million human cells.

According to the researchers, there has yet to be an AI foundation model created to understand the dynamics of chromatin on transcription. Chromatin consists of DNA and proteins that make up the structures that contain genes called chromosomes that are located in the cell nucleus of plants, animals, and people according to the National Human Genome Research Institute. There are 46 chromosomes arranged in 23 pairs inside each cell of a typical human body, half of which is inherited from the father, and the other half from the mother. Autosomal chromosomes are the chromosome pairs from 1 to 22. The 23rd pair is the sex chromosome that determines if a human is male (XY) or female (XX) at birth. Chromosomes are important because they carry the hereditary data from one cell generation to another.

“Relying exclusively on chromatin accessibility data and sequence information, GET achieves experimental-level accuracy in predicting gene expression even in previously unseen cell types,” the researchers reported.

The scientists created a more robust AI model for transcription that is able to predict with high accuracy gene activity in new cell types it has not seen prior. Using GET, they created a public catalog of transcription factors interactions and gene regulation with cell type specificity.

They verified experimentally in the lab GET’s in silico predictions on the PAX5 gene, a transcription factor involved in B lymphocyte (B cell) development that frequently is mutated in B cell precursor acute lymphoblastic leukemia (B-ALL), a common pediatric cancer. B cells create antibodies, a type of protein that binds to pathogens such as viruses, parasites, and bacteria, or foreign substances to neutralize them.

“Using the PAX5 gene as a case study, we illustrated the utility of the catalogue in identifying functional variants in disordered protein domains that were previously difficult to study,” concluded the scientists.

With this breakthrough, researchers have a new AI tool to help predict gene activity across a wide variety of different human cells types that may expedite research for genetic disorders and complex diseases such as neurological disorders, developmental disorders, syndromes, autoimmunity, metabolic diseases, cardiovascular diseases, and cancer in the not-so-distant future.

Copyright © 2025 Cami Rosso All rights reserved.

advertisement
More from Cami Rosso
More from Psychology Today
More from Cami Rosso
More from Psychology Today