A collaboration between the Arc Institute, NVIDIA, and academic partners, Evo 2 is a transformer-based model trained on 8.8 trillion nucleotides from diverse organisms, capable of processing DNA sequences up to 1 million base pairs in length1. This represents a 100-fold increase in context length compared to predecessors like DNABERT.
Key Innovations
Long-Context Modeling: Predicts exon-intron boundaries, transcription factor binding sites, and epigenetic modifications across megabase-scale genomic regions.
Variant Effect Scoring: Quantifies the functional impact of single-nucleotide polymorphisms (SNPs) by comparing reference and altered sequence likelihoods, achieving AUC-ROC scores >0.85 in validation studies1.
Conditional Sequence Generation: Designs synthetic DNA constructs with user-defined promoters, enhancers, and coding regions, validated in yeast and human cell lines.
Importance
Evo 2’s ability to analyze non-coding regions and structural variants fills a critical gap in target identification. For instance, it can prioritize regulatory elements driving oncogene expression in cancer or predict off-target effects of CRISPR edits. The model’s scalability also facilitates whole-genome analyses previously restricted to focused loci1.