In this issue:
Welcome back to your weekly dose of AI news for Life Science!
This week, we have some exciting new models lined up for you:
X-Atlas/Orion: A Massive New Dataset to Power AI-Driven Biology 🧬
PerTurboAgent: A Self-Planning AI for Smarter Genetic Experiments 🧫
BaseData: Smashing Biology's Data Wall to Unlock AI's Potential 🌳
Dive into these game-changing innovations and explore how they are transforming the biotech and healthcare landscapes!
X-Atlas/Orion: A Massive New Dataset to Power AI-Driven Biology 🧬
The development of "foundation models" for biology holds the potential to create virtual cells that could dramatically speed up scientific discovery. However, these powerful AI models need vast amounts of high-quality data from experiments that directly test the effects of genetic changes, and generating this data at a massive scale has been a major bottleneck due to logistical challenges and variability. To solve this, researchers developed "Fix-Cryopreserve-ScRNAseq" (FICS) Perturb-seq, an industrialized platform that uses chemical fixation and cryopreservation to make data generation scalable and consistent. Using this new platform, they created and released X-Atlas/Orion, the largest publicly available Perturb-seq atlas, containing eight million cells from experiments targeting every protein-coding gene in the human genome.
🔨Applications
Advance the development of biological foundation models by providing a massive, high-quality, and deeply sequenced dataset designed for training.
Enable the study of dose-dependent genetic effects by using sgRNA abundance as a built-in proxy for the strength of gene knockdown.
Improve the consistency and logistical feasibility of future genome-scale screens by using a platform that reduces batch effects and decouples cell harvesting from library preparation through long-term cryopreservation.
📌 Key Insights
Scalable platform overcomes key bottlenecks: The FICS Perturb-seq platform integrates chemical fixation, cell sorting, cryopreservation, superloading of microfluidics, and automation to produce highly consistent data. This approach yielded significantly greater batch-to-batch correlation compared to previous large-scale datasets (median Spearman correlation of 0.993 in HCT116 cells vs. 0.967 in the Replogle et al. dataset).
sgRNA abundance stratifies perturbation strength: The study demonstrates that the level of single guide RNA (sgRNA) transcripts directly correlates with gene knockdown efficiency (R = 0.910 in HCT116 cells). This allows researchers to separate cells with strong perturbations from those with weak ones, enabling a more refined analysis of dose-dependent responses.
Largest-ever Perturb-seq atlas released: The work introduces X-Atlas/Orion, a dataset of eight million cells targeting all 18,903 human protein-coding genes. The data is exceptionally deep, with a median of over 16,000 UMIs per cell, which is 1.68 times deeper than the prior benchmark Replogle K562 dataset.
PerTurboAgent: A Self-Planning AI for Smarter Genetic Experiments 🧫
Pooled genetic screening methods like Perturb-seq are powerful, but testing the vast number of possible gene perturbations is experimentally impossible. While iterative experimental designs can maximize knowledge from limited resources, planning these cycles is a complex, time-consuming task requiring multiple skills. Researchers have now developed PerTurboAgent, an AI agent that automates the design of these sequential experiments by using a large language model to analyze data, retrieve knowledge, and plan the next best steps. The agent significantly boosts the efficiency of identifying impactful genes, outperforming existing strategies in simulated large-scale experiments.
🔨Applications
Prioritize gene perturbations strategically to maximize the discovery of genes that regulate a specific cellular phenotype.
Enhance the efficiency of large-scale screening campaigns, enabling researchers to gain more knowledge from a fixed experimental budget.
Accelerate the discovery of causal gene regulatory networks and potential new drug targets by systematically identifying the most informative experiments to run.
📌 Key Insights
Superior Performance in Identifying Hits: PerTurboAgent significantly outperforms other methods, achieving an average hit ratio of 0.440 across 11 phenotypes, compared to 0.255 for the next-best non-agent method (Enrichment+Random) and 0.240 for the existing BioDiscoveryAgent.
Adaptive, Multi-Step Reasoning: The agent dynamically adjusts its strategy, increasingly relying on a machine learning model as more training data accumulates in later experimental rounds. Its internal logs show a clear, interpretable reasoning process, such as reflecting on complex GSEA results before making a decision, which mimics an expert workflow.
Explicit Reasoning is Crucial: An ablation study that removed the agent's ability to perform self-reflection ("Thinking") caused a notable drop in performance, demonstrating that this explicit reasoning capability is key to its success and not just for show.
BaseData: Smashing Biology's Data Wall to Unlock AI's Potential 🌳
The progress of Artificial Intelligence in biology is being held back by a "data wall" and a critical shortage of high-quality, diverse biological sequence data. Existing public databases, the primary source for training AI, were built for academic experiments, not machine learning, and suffer from extreme bias, redundancy, and slow growth. To solve this, researchers have introduced BaseData, a new biological database built from the ground up on a global, scalable, and ethically-partnered biodiscovery pipeline specifically for training AI foundation models. The result is a massive leap forward: as of 2024, BaseData had already expanded the known protein universe by more than ten times after accounting for redundancy.
🔨Applications
Train next-generation biological AI models on a dataset purpose-built to be larger, more diverse, and less biased than any existing public resource.
Discover novel starting points for new therapeutics and industrial solutions from millions of new species and vast, previously unobserved regions of biological sequence space.
Accelerate commercial R&D by using a fully traceable dataset where every sequence is collected with pre-approved commercial use rights and embedded benefit-sharing agreements, removing legal and regulatory ambiguity.
📌Key Insights
A 10x Leap in Known Protein Diversity: BaseData contains 9.8 billion novel genes, which, after clustering to remove redundancy, represents a more than 10-fold expansion in known protein diversity. Its non-redundant version (BaseDataTM-50) is 31.9 times larger than UniRef50, the most common AI training dataset , and includes over 1 million species not found in other major genomic databases.
Overcoming the AI Performance Plateau: The growth of widely used public datasets like UniRef50 has slowed to less than 10% per year, creating a performance plateau where bigger AI models no longer improve. BaseData is designed to break this wall, with a growth rate of up to 2 billion novel protein sequences per month, providing the fresh, diverse data needed to scale next-generation models.
Ethical and Legally-Compliant by Design: Unlike public databases that create a legal gray area, BaseData is built on partnerships in 26 countries with fully traceable benefit-sharing agreements embedded from the start. This novel model ensures data is not only biologically valuable but also ethically and legally ready for commercial use, aligning the incentives of biodiversity providers and commercial researchers.
Did you find this newsletter insightful? Share it with a colleague!
Subscribe Now to stay at the forefront of AI in Life Science.
Connect With Us
Have questions or suggestions? We'd love to hear from you!
📧 Email Us | 📲 Follow on LinkedIn | 🌐 Visit Our Website