Microsoft Research AI for Science has introduced NatureLM (Nature Language Model), a unified foundation model designed to advance scientific discovery by integrating multiple domains through sequence-based representations. NatureLM processes diverse scientific entities—including small molecules, proteins, DNA, RNA, materials, and text—as sequences, enabling cross-domain applications and generative tasks.
Key values of NatureLM include:
Cross-domain integration:
Unifies biology, chemistry, and materials science by representing entities (e.g., SMILES for molecules, FASTA for proteins) as sequences, termed the "language of nature"
Enables cross-domain tasks like protein-to-molecule design, RNA generation guided by DNA templates, and material composition optimization
Generative capabilities:
Generates and optimizes scientific entities (e.g., drug candidates, stable proteins, CRISPR guide RNAs) using text instructions
Achieves state-of-the-art performance in tasks such as retrosynthesis prediction (71.9% top-1 accuracy) and SMILES-IUPAC translation, outperforming specialist models like STOUT and general LLMs like GPT-4
Scalability:
Trained in three sizes (1B, 8B, 46.7B parameters), with larger models showing clear performance gains; the 46.7B MoE model excels in 19/22 evaluated tasks
Training and adaptability:
Pre-trained on 143B tokens of scientific data (10% text, 90% domain-specific sequences) and fine-tuned with 5.1M instruction-response pairs
Incorporates reinforcement learning and task-specific fine-tuning (e.g., retrosynthesis, material property prediction) to enhance performance
Applications:
Drug discovery: Hit generation, ADMET optimization, synthesis route prediction
Protein engineering: Designing antigen-binding antibodies, heme-binding proteins
Material design: Generating novel compositions (e.g., ultra-high bulk modulus materials validated by DFT)
CRISPR systems: Guide RNA design with >95% validity
Limitations and future work:
Language capabilities lag behind general-purpose LLMs (31.8% win rate on AlpacaEval vs. Mixtral)
Plans to improve few-shot learning and integrate 3D structural data for enhanced accuracy
NatureLM represents a paradigm shift toward generalist AI for scientific research, enabling interdisciplinary discovery through unified sequence modelling.