Ad
Researchers at Los Alamos National Laboratory have developed a groundbreaking deep learning model designed to analyze the intricate relationship between transcription factors and gene activity.
A new AI model leverages deep learning to understand the binding of transcription factors to DNA, focusing on the process of DNA breathing.
This innovative approach has led to a 9.6% improvement in predicting transcription factor bindings, offering insights that could revolutionize drug development and genomic research.
Revolutionary AI Model for Disease Research
To better understand DNA’s role in disease, scientists at Los Alamos National Laboratory have developed EPBDxDNABERT-2, a pioneering multimodal deep learning model. This model is designed to precisely identify interactions between transcription factors—proteins that regulate gene activity—and DNA. EPBDxDNABERT-2 uses a process known as “DNA breathing,” where the DNA double-helix spontaneously opens and closes, allowing the model to capture these subtle dynamics. This capability has the potential to enhance drug design for diseases rooted in gene activity.
“There are many types of transcription factors, and the human genome is incomprehensibly large,” explained Anowarul Kabir, a researcher at Los Alamos and lead author of the study. “So, it is necessary to find out which transcription factor binds to which location on the incredibly long DNA structure. We tried to solve that problem with artificial intelligence, particularly deep-learning algorithms.”
Enhancing Drug Development With DNA Dynamics
DNA, consisting of an equivalent of 3 billion English letters in each human cell, acts as a blueprint for growth and function. Transcription factors bind to DNA regions, regulating gene expression—how genes guide cell development and function. This regulation plays a role in diseases, such as cancer, so accurately predicting transcription factor binding locations could have a significant impact on drug development.
The foundational model used by the research team was trained on DNA sequences. The team built a DNA simulation program that captures numerous DNA dynamics and integrated it with the genomic foundation model, resulting in EPBDxDNABERT-2, capable of processing genome sequences across chromosomes and incorporating corresponding DNA dynamics as input. One such input, DNA breathing, or the local and spontaneous opening and closing of the DNA double-helix structure, correlates with transcriptional activity, such as transcription factor binding.
“The integration of the DNA breathing features with the DNABERT-2 foundational model greatly enhanced transcription factor-binding predictions,” said Los Alamos researcher Manish Bhattarai. “We give sections of DNA code as input to the model and ask the model whether it binds to a transcription factor, or not, across many cell lines. The results improved the predictive probability of binding specific gene locations with many transcription factors.”
Leveraging Supercomputers for Genomic Analysis
The team ran their deep-learning model on the Laboratory’s newest supercomputer, Venado, which combines a central processing unit with a graphics processing unit to drive artificial intelligence capabilities. A deep-learning model works in ways similar to the brain’s neural networks, incorporating images and text and uncovering complex patterns to generate predictions and insights.
To train the model, the team used gene sequencing data from 690 experimental results, encompassing 161 distinct transcription factors and 91 human cell types. They found that EPBDxDNABERT-2 significantly improves — by 9.6% in one key metric — the prediction of the binding of over 660 transcription factors. Further experiments on in vitro datasets, drawn from experiments in a controlled environment, complemented the in nature datasets, or the data drawn directly from research with living organisms, such as mice.
The Promise of Multimodal Computational Genomics
The team found that while DNA breathing alone can estimate transcriptional activity almost accurately, the multimodal model can extract binding motifs, the specific DNA sequences to which transcription factors bind — a crucial element for explaining transcription processes.
“As demonstrated by its performance across multiple, diverse datasets, our multimodal foundational model exhibits versatility, robustness, and efficacy,” Bhattarai said. “This model signifies a substantial advancement in computational genomics, providing a sophisticated tool for analyzing complex biological mechanisms.”
Reference: “DNA breathing integration with deep learning foundational model advances genome-wide binding prediction of human transcription factors” by Anowarul Kabir, Manish Bhattarai, Selma Peterson, Yonatan Najman-Licht, Kim Ø Rasmussen, Amarda Shehu, Alan R Bishop, Boian Alexandrov and Anny Usheva, 13 September 2024, Nucleic Acids Research.
DOI: 10.1093/nar/gkae783
The work was supported by the National Institutes of Health and the National Science Foundation.
Ad
SomaDerm, SomaDerm CBD, SomaDerm AWE (by New U Life).