Can Language Models Read the Genome? This One Decoded mRNA to Make Better Vaccines
Published:11 Jun.2024 Source:Princeton University, Engineering School
Artificial intelligence (AI), known for coding software and passing the bar exam, has now mastered reading the genetic code. This code, containing instructions for all life's functions, follows rules similar to human languages, with each genomic sequence adhering to a complex grammar and syntax that influences its encoded meaning. Small variations in these sequences can significantly affect biological outcomes, just as altering words in a sentence can change its impact.
A team of researchers led by machine learning expert Mengdi Wang at Princeton University is employing language models to analyze partial genome sequences and optimize them for biological research and medical advancements. Their work, detailed in a paper published in Nature Machine Intelligence, showcases a language model that utilized semantic representation to develop a more efficacious mRNA vaccine, like those used against COVID-19. The researchers concentrated on the untranslated region of mRNA, crucial for controlling protein production efficiency, a key mechanism in mRNA vaccine efficacy. After training the model on a limited range of species, they generated hundreds of optimized sequences, which were experimentally validated. The top-performing sequences exhibited a 33% increase in protein production efficiency, underscoring the potential benefits for emerging therapeutics beyond COVID-19.
Professor Wang's team, collaborating with researchers from RVAC Medicines and the Stanford University School of Medicine, developed a language model distinct in scale, rather than principle, from those powering AI chatbots. Trained on a smaller dataset of a few hundred thousand mRNA sequences and incorporating additional knowledge about protein production, the model created a library of 211 novel sequences optimized for enhanced translation efficiency. This marks the first language model to concentrate on the untranslated region of mRNA, demonstrating improved overall efficiency and predicting sequence performance across various tasks. Despite challenges in compiling a comprehensive, multifaceted dataset from disparate sources, the model's success suggests new avenues for exploring gene regulation, a fundamental aspect of life's functioning linked to disease origins.