Cracking the Code of Life: New AI Model Learns DNA's Hidden Language
Published:28 Sep.2024    Source:Technische Universität Dresden
DNA contains foundational information needed to sustain life. Understanding how this information is stored and organized has been one of the greatest scientific challenges of the last century. With GROVER, a new large language model trained on human DNA, researchers could now attempt to decode the complex information hidden in our genome. Developed by a team at the Biotechnology Center (BIOTEC) of Dresden University of Technology, GROVER treats human DNA as a text, learning its rules and context to draw functional information about the DNA sequences.
 
Large language models, like GPT, have transformed our understanding of language. Trained exclusively on text, the large language models developed the ability to use the language in many contexts. The Poetsch team trained a large language model on a reference human genome. The resulting tool named GROVER, or "Genome Rules Obtained via Extracted Representations," can be used to extract biological meaning from the DNA. GROVER learned the rules of DNA. The team showed that GROVER can not only accurately predict the following DNA sequences but can also be used to extract contextual information that has biological meaning, e.g., identify gene promoters or protein binding sites on DNA. GROVER also learns processes that are generally considered to be "epigenetic," i.e., regulatory processes that happen on top of the DNA rather than being encoded.
 
To train GROVER, the team had to first create a DNA dictionary. They used a trick from compression algorithms. "This step is crucial and sets our DNA language model apart from the previous attempts," says Dr. Poetsch. GROVER promises to unlock the different layers of genetic code. DNA holds key information on what makes us human, our disease predispositions, and our responses to treatments. "We believe that understanding the rules of DNA through a language model is going to help us uncover the depths of biological meaning hidden in the DNA, advancing both genomics and personalized medicine," says Dr. Poetsch.