New AI Machine Learning Gains a Toehold on Synthetic Biology
Harvard and MIT's AI for synthetic RNA-based tools is tested on the coronavirus.
Posted Oct 14, 2020
In a landmark achievement, two teams of scientists from the Wyss Institute at Harvard University and the Massachusetts Institute of Technology (MIT) unveiled novel machine learning solutions for an RNA-based synthetic biology tool and published their findings in two research papers last week in Nature Communications.
Life is messy and inherently complex. The scientific study of life and living organisms, biology, is a complicated endeavor, as is the engineering field of synthetic biology. Synthetic biology is the science of creating or redesigning new biological systems found in nature. It spans across multiple fields such as biophysics, chemistry, molecular biology, genomics, engineering, CRISPR, DNA sequencing, microfluidics, BioBricks, materials science, and more scientific disciplines. The recent advances in pattern recognition and prediction capabilities of artificial intelligence (AI) machine learning, namely deep learning, may help tease out discoveries in the complexity.
Toehold switches are a synthetic RNA (ribonucleic acid) tool that can detect and respond when triggered by the presence of RNA molecules. RNA is a single strand nucleic acid that is involved in many regulatory cell functions. The genetic material of a majority of viruses contain either RNA or DNA (deoxyribonucleic acid), according to Molecular Cell Biology. A gene is the basic unit of hereditary information positioned on a fixed area on a chromosome that consists of DNA. Examples of viruses with DNA include herpes, human papillomaviruses, pox viruses, and adenoviruses. In contrast, coronaviruses, such as SARS-CoV-2 that causes the COVID-19 disease, has RNA as its genetic material according to the Johns Hopkins POC-IT ABX Guide.
In the first study, scientists James Collins, Luis Soenksen, George Church, Alexander Garruss, and Nicolaas Angenent-Mari created a custom dataset of over 91,500 toehold switches to enable AI deep learning. Deep neural networks typically require large datasets in order for the algorithm to “learn” and recognize patterns with a high degree of accuracy. However, there are not any large toehold datasets in existence. To solve this data availability problem, the scientists created a toehold switch data repository based on 240,000 sequences across the entire genomes of 23 disease-causing viruses and 906 human transcription factors.
The researchers generated toehold data libraries for “on” and “off” states from a synthesized oligo pool which were then sorted using a fluorescence-activated cell sorter. After sorting, the toehold switches were quantified using next-generation sequencing, and then validated in vitro, the setting commonly used in synthetic biology for cell-free protein synthesis.
Prior to training a deep learning algorithm on the newly created and validated dataset, the team tested traditional methods for analyzing synthetic RNA modules to predict toehold switch behavior. They used k-mer searches of biological sequence data, analytic software (ViennaRNA and NUPACK), and a sophisticated thermodynamic model that uses a ribosome-binding site calculator (RBS). The team found that existing methods were suboptimal predictors, and some were not very practical for computer-aided synthetic biology design of this type.
So the team set out to use AI deep learning with the goal of developing models with higher predictive capabilities than the current state-of-the-art solutions. Specifically, the team used feed-forward neural networks or multilayer perceptron (MLP). A three-layer neural network was trained with 30 previously calculated thermodynamic rational features which resulted in higher predictive capabilities than state-of-the-art analytics for RNA synthetic biology tools. The team then trained a neural network using sequence representations of the toehold switches, rather than precalculated features, to boost the predictive power and reducing bias. As a result, there was a performance improvement with the sequence representations. Then the scientists validated the multilayer perceptron.
The results were favorable. “The improved performance observed when training the models directly on nucleotide sequence rather than thermodynamic features, even for an external dataset, suggest a competent degree of biological generalization and supports the value of modeling RNA synthetic biology tools using deep-learning and high-throughput datasets, removing the current assumptions of mechanistic rational parameters,” wrote the researchers.
The researchers then tested the toehold dataset on convolutional neural networks (CNN) and long short-term memory (LSTM) recurrent neural networks to see if these would yield improved predictive capabilities on higher-capacity machine learning models. Interestingly, the researchers found that these neural network architectures “did not lead to superior predictive models, as compared to the sequence-based, three-layer MLP described previously.” Instead, the team found that with the increased model capacity came issues of either under-fitting and over-fitting.
The scientists then trained a convolutional neural network on a two-dimensional nucleotide complementarity map representation in efforts to produce visuals of the RNA secondary structures and called it a VIS4Map (Visualizing Secondary Structure Saliency Map).
When applied to the entire toehold switch dataset, the VIS4Map convolutional neural network “significantly outperformed an MLP trained on rational thermodynamic features” and could “identify both equilibrium and kinetically stable RNA secondary structures.”
Thus, the researchers demonstrated that a deep learning model trained on sequences-only are useful for analyzing RNA synthetic biology tools. The findings also suggest that this deep learning tool could be expanded in use for discovering unknown equilibrium or kinetically stable structures related to RNA, not just for toehold switches and other synthetic RNA.
Extending upon the first study and the toehold dataset, a second study was done by the scientific team of Diogo Camacho, Timothy Lu, Bianca Lepe, Miguel Alcantar, Pradeep Ramesh, Katherine Collins, and Jacqueline Valeri. The researchers created two different but complementary deep learning models for toehold selection and design called STORM (Sequence-based Toehold Optimization and Redesign Model) and NuSpeak (Nucleic-Acid Speech).
NuSpeak is aptly named, as it is based on natural language processing (NLP) combined with a convolutional neural network. NuSpeak enables the redesign of the last nine nucleotides of a toehold switch without impacting the other 21 nucleotides. The researchers reported a roughly 160 percent improvement in sensor performance using NuSpeak when optimizing toehold switches to sense the SARS-CoV-2 virus genome.
STORM enables the complete redesign of the toehold. The team combined a multilayer perceptron with a convolutional neural network to create a model that processes toehold sequences as one-dimensional images, or lines of nucleotide bases, with predictive capabilities. The researchers used STORM to optimize four predicted poor-performing SARS-CoV-2 viral RNA sensors resulting in increases in toehold performance.
This recent advancement may help accelerate the research and development of rapid diagnostics, as well as new treatments for diseases with de novo drugs and therapeutics. Artificial intelligence and synthetic biology are two innovative technologies that when uniquely combined, may help improve the human condition for a better future ahead.
Copyright © 2020 Cami Rosso All rights reserved.