What task in your drug discovery work is the most time consuming?
BacFormer is a transformer-based foundation model trained on more than 9 million bacterial genomes. It captures the evolutionary, structural, and functional landscape of microbial life. By representing entire genomes as tokenised sequences, BacFormer enables a new framework for zero-shot prediction of antibiotic resistance, pathogenicity, host adaptation, and more.
Key Innovations and Capabilities:
1. Sequence-to-Phenotype Prediction
Predict: Encode full bacterial genomes into compact, learned representations for downstream tasks such as host specificity prediction, resistance gene classification, and horizontal gene transfer detection.
Explain: Visualise important regions and motifs using attention weights and saliency maps. Capture meaningful evolutionary signals across phyla and gene clusters.
Adapt: Use pretrained embeddings for zero-shot or few-shot learning in new species, minimising the need for task-specific retraining.
2. Genome-Level Tokenisation and Modelling
Custom tokeniser trained on over 200 million open reading frames (ORFs) compresses each genome into 4096 discrete tokens.
BERT-style masked language model trained across 9.1 million genomes from 113 phyla. This is the largest bacterial training corpus used to date.
Embeddings are phylogenetically aware. Related taxa cluster naturally without requiring labels or supervision.
3. Multimodal Functional Learning
BacFormer supports classification, retrieval, clustering, and interpretability using a single frozen backbone.
Outperforms baseline k-mer and gene-based models on antimicrobial resistance detection, pathogenicity classification, and plasmid prediction.
Accepts diverse input types, including full genomes, genome fragments, and synthetic constructs.
4. Applications in Microbial Genomics
Antibiotic Resistance: Predicts AMR gene classes with over 93 percent accuracy. Maintains high performance on unseen resistance families.
Pathogenicity and Host Specificity: Identifies zoonotic versus human-restricted strains with more than 90 percent accuracy across multiple genera.
Genome Retrieval and Clustering: Embeddings group genomes and plasmids by evolutionary and functional similarity, enabling fast search and comparative analysis.
Evolutionary Insight: Detects signatures of horizontal gene transfer and convergent evolution using attention maps and learned representations.
5. Performance and Validation
Training Scale:
Trained on 9.1 million bacterial genomes and roughly 200 million ORFs from proGenomes and GTDB.
Each genome tokenized into 4096 learned tokens using a BPE-style tokenizer designed for DNA.
Embeddings encode taxonomic, ecological, and functional features without using any metadata during training.
Benchmark Metrics:
AMR Gene Classification: Achieves over 93 percent accuracy, exceeding alignment- and gene-centric baselines.
Pathogen Detection: Delivers more than 90 percent accuracy on pathogen versus non-pathogen binary classification tasks.
Plasmid Identification: Scores over 0.95 F1 using embeddings alone to classify plasmid versus chromosomal origin.
Genome Retrieval: Top-5 cosine similarity retrieval achieves over 95 percent precision.
Taxonomic Generalisation: Recovers genus-level NCBI taxonomy structure from embeddings, even though taxonomic labels were never used during training.
6. Limitations and Future Work
Current focus: Optimized for bacterial genomes and gene-centric tasks, especially from short-read assemblies.
Challenges: Limited resolution on mobile genetic elements, prophages, and rare accessory genes. Performance may drop on low-coverage or highly fragmented assemblies.
Roadmap: Extend to long-read assemblies, metagenomic binning, and phenotype-linked training using MIC data or growth assays for direct genotype-to-phenotype mapping.
Why It Matters
Traditional bacterial genomics relies on static gene annotations and manually curated databases. These methods are slow to update and often fail to capture the diversity found in real-world microbial data. BacFormer replaces these brittle pipelines with a single, scalable model that learns directly from genome sequences. It predicts phenotypes, detects resistance, tracks host adaptation, and reveals evolutionary patterns without requiring predefined rules or expert input. This unlocks real-time analysis of microbial threats, accelerates pathogen surveillance, and offers a new foundation for understanding how bacteria evolve and spread across environments.
Share this post