Kiin Bio Weekly
Kiin Bio Weekly Podcast - AI Updates for Life Sciences
Evo 2: Genome-Scale DNA Language Modeling
0:00
-16:17

Evo 2: Genome-Scale DNA Language Modeling

A collaboration between the Arc Institute, NVIDIA, and academic partners, Evo 2 is a transformer-based model trained on 8.8 trillion nucleotides from diverse organisms, capable of processing DNA sequences up to 1 million base pairs in length1. This represents a 100-fold increase in context length compared to predecessors like DNABERT.

Key Innovations

  1. Long-Context Modeling: Predicts exon-intron boundaries, transcription factor binding sites, and epigenetic modifications across megabase-scale genomic regions.

  2. Variant Effect Scoring: Quantifies the functional impact of single-nucleotide polymorphisms (SNPs) by comparing reference and altered sequence likelihoods, achieving AUC-ROC scores >0.85 in validation studies1.

  3. Conditional Sequence Generation: Designs synthetic DNA constructs with user-defined promoters, enhancers, and coding regions, validated in yeast and human cell lines.

Importance

Evo 2’s ability to analyze non-coding regions and structural variants fills a critical gap in target identification. For instance, it can prioritize regulatory elements driving oncogene expression in cancer or predict off-target effects of CRISPR edits. The model’s scalability also facilitates whole-genome analyses previously restricted to focused loci1.

Discussion about this episode

User's avatar