On February 25, 2025, the Arc Institute and Vevo Therapeutics announced the release of the Arc Virtual Cell Atlas, a comprehensive open-source resource containing single-cell transcriptomics data from over 300 million cells. This initiative aims to advance AI-driven biological research by providing extensive datasets for modelling and understanding cellular behaviours.
Key Components of the Arc Virtual Cell Atlas:
Tahoe-100M by Vevo Therapeutics: This dataset encompasses 100 million cells and maps 60,000 drug-cell interactions across 50 cancer cell lines subjected to 1,200 drug perturbations. Generated using Vevo's Mosaic Technology, Tahoe-100M is 50 times larger than all previously public drug-perturbed data combined. The dataset was produced with support from Parse Biosciences' GigaLab for single-cell RNA sequencing and Ultima Genomics for sequencing.
scBaseCamp by Arc Institute: This AI-curated repository comprises gene expression data from over 200 million cells spanning 21 different species. Sourced from public repositories, the data has been standardized to ensure interoperability, facilitating its use in machine learning models.
Need and Application:
The Arc Virtual Cell Atlas addresses the critical need for large-scale, high-quality single-cell data to train AI models capable of predicting cellular responses to various perturbations. By integrating both observational and perturbational datasets, researchers can analyze natural cell states alongside those altered by drugs or chemicals, enhancing the understanding of disease mechanisms and accelerating drug discovery. This resource has the potential to reduce years of laboratory work to computational analyses completed in minutes.
Researchers and scientists can access the Arc Virtual Cell Atlas through the Arc Institute's portal, enabling them to leverage these extensive datasets for various applications, including developing predictive models, studying drug responses, and exploring disease mechanisms across diverse biological contexts.