Kiin Bio Weekly
Kiin Bio Weekly Podcast - AI Updates for Life Sciences
NatureLM: Deciphering the Language of Nature for Scientific Discovery
0:00
-14:58

NatureLM: Deciphering the Language of Nature for Scientific Discovery

Microsoft Research AI for Science has introduced NatureLM (Nature Language Model), a unified foundation model designed to advance scientific discovery by integrating multiple domains through sequence-based representations. NatureLM processes diverse scientific entities—including small molecules, proteins, DNA, RNA, materials, and text—as sequences, enabling cross-domain applications and generative tasks.

Key values of NatureLM include:

Cross-domain integration:

  • Unifies biology, chemistry, and materials science by representing entities (e.g., SMILES for molecules, FASTA for proteins) as sequences, termed the "language of nature"

  • Enables cross-domain tasks like protein-to-molecule design, RNA generation guided by DNA templates, and material composition optimization

Generative capabilities:

  • Generates and optimizes scientific entities (e.g., drug candidates, stable proteins, CRISPR guide RNAs) using text instructions

  • Achieves state-of-the-art performance in tasks such as retrosynthesis prediction (71.9% top-1 accuracy) and SMILES-IUPAC translation, outperforming specialist models like STOUT and general LLMs like GPT-4

Scalability:

  • Trained in three sizes (1B, 8B, 46.7B parameters), with larger models showing clear performance gains; the 46.7B MoE model excels in 19/22 evaluated tasks

Training and adaptability:

  • Pre-trained on 143B tokens of scientific data (10% text, 90% domain-specific sequences) and fine-tuned with 5.1M instruction-response pairs

  • Incorporates reinforcement learning and task-specific fine-tuning (e.g., retrosynthesis, material property prediction) to enhance performance

Applications:

  • Drug discovery: Hit generation, ADMET optimization, synthesis route prediction

  • Protein engineering: Designing antigen-binding antibodies, heme-binding proteins

  • Material design: Generating novel compositions (e.g., ultra-high bulk modulus materials validated by DFT)

  • CRISPR systems: Guide RNA design with >95% validity

Limitations and future work:

  • Language capabilities lag behind general-purpose LLMs (31.8% win rate on AlpacaEval vs. Mixtral)

  • Plans to improve few-shot learning and integrate 3D structural data for enhanced accuracy

NatureLM represents a paradigm shift toward generalist AI for scientific research, enabling interdisciplinary discovery through unified sequence modelling.

Discussion about this episode

User's avatar