🧬 scConcept: Why Your Cell Embeddings Don't Work

Deep Dive | Edition 20

May 20, 2026

Welcome back to the deep dive, where we break down the AI tools and data reshaping how new drugs are discovered. In each edition, we speak directly with the teams behind these tools to explain what they solve, how they work and where they are going next.

We know how hard science is. That’s why we built the Pioneer Programme.

We’re selecting academic and nonprofit research teams to get one year of free access to our drug discovery platform plus hands-on support from our science team. If your research bottleneck isn’t data but connecting the findings you already have, this is for you.

No cost. No data transfer. All IP stays with your institution. Applications close August, cohort starts September.

🔴 The Problem

If you zoom out, most single-cell foundation models share the same basic idea. Treat genes like words. Mask some of them. Ask the model to predict what’s missing. This strategy mirrors masked language modelling introduced in BERT, where models learn by predicting missing tokens in a sentence. It works well enough on paper, but something feels off once you start using the embeddings downstream.

As Mojtaba Bahrami put it when we spoke, “In language models, training and inference are basically the same task. You predict the next token. But in single-cell, no one actually cares about predicting masked genes. People care about the embedding.”

That mismatch matters. Masked gene prediction optimises the wrong thing. The model gets good at reconstructing counts, but the cell-level representation becomes an afterthought. Most approaches average learned gene embeddings and hope for the best. This may work depending on the downstream question but often it doesn’t. If the question can not be simply answered by looking at gene level information and needs understanding higher level biological processes going on in the cell, then we have a problem.

The cracks show up quickly. Simple methods like PCA or VAEs still outperform large models on tasks like cell type annotation. Benchmarking efforts such as the scIB framework have shown that classical integration methods can remain highly competitive across datasets. Spatial assays break embeddings entirely. New technologies shift the latent space so much that “foundation” starts to feel like marketing rather than reality.

💡 The Idea

Here comes scConcept. Instead of asking the model to reconstruct genes, it asks something more aligned with how the actual biology is: What is the identity of a cell given a partial set of its gene expression. In other words, can we identify the biological processes that make up the identity of a cell consistently through looking at different views of its transcriptome profile?

**Figure 1** | **Contrastive learning at the cell level.** Each cell is split into two non-overlapping gene subsets, which are passed through a shared transformer encoder. A dedicated CLS token represents the whole-cell embedding, and a contrastive loss pulls together embeddings from the same cell while pushing apart different cells. Rank encoding ensures the model learns relative gene expression rather than absolute counts.

Technically, this is done using contrastive learning. Each cell is split into two disjoint gene subsets. The model sees both views and is trained to pull them together in embedding space, while pushing apart views from different cells.

To make this work, the team borrowed a trick from BERT: a dedicated CLS token. In language, CLS represents a sentence. In scConcept, it represents the entire cell. The loss is applied directly to that token.

“If the model can solve this task,” Mojtaba explained, “it has to develop a high-level idea of cell identity. There’s no shortcut.”

That framing turns the embedding from a side effect into the main objective. The model is no longer rewarded for local gene accuracy, but for capturing global cellular identity.

📊 Where the data work really shows

The architecture alone is only half the story. The other half lives in how the data is presented. One of the team’s biggest insights came from a failed experiment. Early versions of scConcept used the same binning strategies as other models. The result was large, technology-driven shifts in the embedding space.

“That was the moment we realised something was fundamentally wrong,” Mojtaba said. “The model was learning the technology, not the biology.”

The fix was surprisingly simple. Instead of feeding absolute expression values, scConcept uses rank encoding. Genes are ordered by expression within each cell. Only relative relationships matter.

If gene A is higher than gene B, the model sees that. The actual counts are ignored.

This turns out to be remarkably robust. Different assays may disagree on absolute numbers, but gene rankings tend to stay stable. Rank encoding strips away much of the batch effect before the model even starts learning.

Then comes gene subsetting. During training, the model constantly sees partial views of cells, including realistic gene panels from spatial technologies. This forces the embedding to stay stable even when most genes are missing.

The result is a representation that does not panic when faced with a 300-gene panel instead of a full transcriptome.

**Figure 2 | Gene-panel agnostic embeddings.** scConcept maintains alignment between full-transcriptome data and reduced spatial gene panels. Unlike other models, performance degrades gradually as genes are removed, highlighting robustness to targeted assays such as Xenium and other spatial technologies.

📢 Why it is different

The evaluation results reflect that design choice. scConcept consistently outperforms other foundation models on cell type annotation, cross-technology transfer, and spatial imputation, often matching or beating domain-specific tools. More importantly, it fails more gracefully.

**Figure 3 | Improved cell-type annotation across datasets.** scConcept outperforms existing single-cell foundation models and classical methods on cell-type annotation benchmarks. Performance gains are consistent across accuracy and macro F1, demonstrating that optimising the embedding directly leads to stronger downstream classification.

When information is missing, performance drops for the right reasons, not because the embedding space collapses. Closely related cell types remain close. Spatial structure is preserved. Adaptation improves things further without retraining the entire model.

One subtle but important point the authors stress is that scConcept is not a batch correction method. It does not erase real biological differences. It just stops the model from confusing technology with biology.

That distinction matters if you actually want to interpret what the model is doing.

🔮 The Future

**Figure 4 | Integration across technologies and platforms.** scConcept co-embeds cells from scRNA-seq, snRNA-seq, Slide-seq, Xenium, CosMx, and MERFISH without explicit batch correction. Cell types cluster by biology rather than assay platform, demonstrating robustness to technical variation.

The long-term vision goes further than single cells. Right now, scConcept is trained on around 30 million cells, deliberately matching the scale of existing models. The next step is obvious: train on hundreds of millions. Initiatives like the Human Cell Atlas and repositories such as CELLxGENE now host hundreds of millions of publicly available cells, making this scale increasingly realistic.

However, the more interesting shift is conceptual. As Mojtaba put it, “Cells are the starting point. But biology doesn’t stop there. Tissues matter. Spatial context matters. Patients matter.”

Contrastive learning over partial views opens the door to representing tissue sections, neighbourhoods, even whole samples. Instead of asking whether two gene sets come from the same cell, future models might ask whether two regions come from the same tissue state.

That feels like a natural progression, especially as spatial assays become routine.

For now, scConcept is a reminder that better questions often beat bigger models. By focusing on what practitioners actually use, rather than what looks impressive on paper, it points toward a quieter but more useful future for single-cell AI, which is probably what the field needs right now.

👨‍🔬 Get in touch with Mojtaba.

📑 Read the paper.

💻 Check out the code.

Thanks for reading Kiin Bio Weekly!

💬 Get involved

We’re always looking to grow our community. If you’d like to get involved, contribute ideas or share something you’re building, fill out this form or reach out to me directly.

Share Kiin Bio Weekly

Subscribe now to stay at the forefront of AI in Life Science and keep up with this upcoming season of deep dives.

Connect With Us

Have questions on this or suggestions for our next deep dive? We’d love to hear from you!

📧 Email Us | 📲 Follow on LinkedIn | 🌐 Visit Our Website

Kiin Bio Weekly

Discussion about this post

Ready for more?