New Open-Source AI Machine Learning Tools to Fight Cancer
IBM Research in Switzerland debuts novel anticancer AI deep learning solutions.
Posted July 25, 2019

In Basel, Switzerland at this week’s 18th European Conference on Computational Biology (ECCB) and 27th Conference on Intelligent Systems for Molecular Biology (ISMB), IBM will share three novel artificial intelligence (AI) machine learning tools called PaccMann, INtERAcT, and PIMKL, that are designed to assist cancer researchers.
Bringing a drug to market is a costly endeavor in time and investments that can span over a decade, with no guarantee of passing FDA approvals. AI machine learning can greatly accelerate the process of pharmaceutical drug discovery. The IBM Research Zurich team of Matteo Manica, Ali Oskooei, Jannis Born, and María Rodríguez Martínez, along with colleagues Vigneshwari Subramanian and Julio Sáez-Rodríguez from RWTH Aachen University in Germany, created PaccMann (Prediction of anticancer compound sensitivity with multi-modal attention-based neural networks) and published their work in arXiv in July 2019.
“There have been a plethora of works focused on prediction of drug sensitivity in cancer cells, however, the majority of them have focused on the analysis of unimodal datasets such as genomic or transcriptomic profiles of cancer cells,” wrote the IBM researchers in their study. “To the best of our knowledge, there have not been any multi-modal deep learning solutions for anticancer drug sensitivity prediction that combine a molecular structure of compounds, the genetic profile of cells and prior knowledge of protein interactions.”
PaccMann’s deep learning solution uses a three-prong data approach, incorporating transcriptomic profiles of cancer cells, protein interactions within cells, and the molecular structure of compounds in order to predict the impact of a drug sensitivity on cancer cells. The study tested PaccMann’s ability to predict drug sensitivity on over 200,000 drug-cell line pairs in the Genomics of Drug Sensitivity in Cancer (GDSC)—a database that characterizes human cancer cell lines with a wide range of anticancer drugs where sensitivity patterns of cell lines are correlated with genomic data to identify biomarkers that are predictive of sensitivity. The GDSC database is one of the largest public resources with drug sensitivity data for nearly 75,000 experiments and 138 anticancer drugs across nearly 700 cancer cell lines.
The PaccMann model was fed drug-cell pair with compound information encoded using simplified molecular-input line-entry system (SMILES)—a line notation for encoding molecular structures using ASCII strings. ASCII is a computer character set consisting of 128 seven-bit combinations. The cancer cell’s gene expression profile is also input in order to predict the half-maximal inhibitory concentration (IC50) sensitivity value to evaluate drug potency.
“With the rise of deep learning methods and their proven ability to learn the most informative features from raw data, it seems imperative to approach chemical problems from a similar standpoint,” wrote the researchers.
According to a July 22, 2019 IBM Research article written by Matteo Manica and Joris Cadow, “PaccMann not only predicted sensitivity for the drug-cell line pairs more accurately than alternative tools, it also offered explainability, highlighting which specific genes and which portions of the compound’s molecular structure it paid the most attention to while performing the predictions.”
There is a wealth of scientific cancer research in scientific publications—harnessing the data manually is extremely time-consuming, and is limited in reach. For example, today there are more than 1,585,900 papers on cancer on PubMed Central (PMC)—a full-text archive of biomedical and life sciences journal literature at the U.S. National Institutes of Health's National Library of Medicine (NIH/NLM) that is available at no cost. Tens of thousands more cancer-related research studies are published each year. The huge volume makes it a candidate for AI to help go through the massive amounts of data and extract the specific information needed by researchers.
The IBM Research team of Matteo Manica, Roland Mathis, Joris Cadow, and María Rodríguez Martínez located in Zurich, Switzerland created INtERAcT (Interaction Network infErence from vectoR representATions of words)— an AI method that uses unsupervised machine learning to extract interactions from a large data repository of biomedical publications for cancer research.
INtERAct automatically extracts specific information from biomedical publications. It is a method that can be deployed to any knowledge domain. In their study, it was used to identify and extract protein-protein interactions (PPIs) related to prostate cancer
The researchers used word embedding—a set of methods for language modeling and feature learning that maps words in a vocabulary into vectors—a more robust method than frequency-based approaches. “Word vector representations have gained broad recognition thanks to the recent work of Mikolov et al., who demonstrated that word embeddings can facilitate very efficient estimations of continuous-space word representations from large datasets (~1.6 billion words),” the IBM researchers wrote in their study.
Their approach leverages prior research published in 2013 on arXiv titled, “Efficient Estimation of Word Representations in Vector Space,” by Google’s Jeffrey Dean, Greg Corrado, Kai Chen, and Tomas Mikolov. In the Google study, the researchers demonstrated that word vectors can be “successfully applied to automatic extension of facts in Knowledge Bases, and also for verification of correctness of existing facts,” and that “it is possible to train high-quality word vectors using very simple model architectures, compared to the popular neural network models (both feedforward and recurrent).” In effect, the study showed that it is possible to “compute very accurate high dimensional word vectors from a much larger data set” due to reduced computational complexity.
“INtERAcT exploits a vector representation of words, computed on a corpus of domain-specific knowledge, and implements a new metric that estimates an interaction score between two molecules in the space where the corresponding words are embedded,” wrote the research team in their study published in Nature Machine Intelligence in April 2019. “We use INtERAcT to reconstruct the molecular pathways of 10 different cancer types using corpora of disease-specific articles, considering the STRING database as a benchmark.”
Existing current solutions require some tweaking by human experts in order to perform well; hence those systems are not fully automated. INtERAcT is a more novel approach.
“Our metric outperforms currently adopted approaches and it is highly robust to parameter choices, leading to the identification of known molecular interactions in all studied cancer types,” reported the researchers. “Furthermore, our approach does not require text annotation, manual curation or the definition of semantic rules based on expert knowledge, and can therefore be efficiently applied to different scientific domains.”
Another complex task for cancer researchers is creating predictive models to use for patient stratification and biomarker discovery. The same team of scientists in Zurich are also the creators of PIMKL (Pathway-Induced Multiple Kernel Learning)—a novel supervised classification machine learning algorithm for predicting performance and interpretability in phenotypes based on molecular data. Specifically, PIMKL is a method to predict phenotype from multi-omic data such as mRNA (messenger RNA) and CNA (copy number alteration), that is based on optimizing a blend of pathway-induced kernels. PIMKL was deployed on IBM Cloud and is open-source software.
In their study published in NPJ Systems Biology and Applications in March 2019, the Zurich-based IBM Research team of Manica, Mathis, Cadow, and Martínez focused on finding a way to predict the likelihood of breast cancer relapse within five years after the first treatment.
“We have demonstrated that the resulting weighted combination of kernels can be interpreted as a phenotypic molecular signature and provides insights into the underlying molecular mechanisms,” reported the researchers. “The quality and the stability of the obtained signatures has been thoroughly investigated, and we have shown that PIMKL outperforms other methods and finds stable molecular signatures across different breast cancer cohorts.”
PaccMann, INtERAcT, and PIMKL are all open-source software; it is widely available at no charge, and may be modified and redistributed. Making these novel AI deep learning tools open to the public will enable the creation of future models to accelerate scientists in many areas, including drug discovery by pharmaceuticals, life sciences and biotech in the quest to discover better, more targeted treatments with fewer side-effects for cancer patients in the future.
Copyright © 2019 Cami Rosso All rights reserved.
References
Manica, Matteo, Cadow, Joris. “Novel AI tools to accelerate cancer research.” IBM Research. July 22, 2019. Retrieved from https://www.ibm.com/blogs/research/2019/07/ai-tools-for-cancer-research/