# 2: Life Science x AI

Kiin Bio

May 19, 2024

Welcome back to your weekly dose of AI news for Life Science!

Here’s what we have for you this week:

Latest machine learning tools to design new antibiotics 👀
New visual models for medical application 📷
New datasets & benchmarks 💿
Life Science tools of the week 🛠️

If you like the article … follow us on LinkedIn

Don’t forget to take our short survey to help us understand how you currently use AI in your day-to-day!

Start Survey

Latest machine learning tool to design new antibiotics 👀

Pathogens pose a huge public health concern. The good thing: we have antibiotics. The bad thing: pathogens are adapting and developing defences against antibiotics. This is where AI comes in play! A group of researchers from Los Alamos National Laboratory developed a machine learning model to understand key properties needed in drug candidates to effectively inhibit Pseudomonas aeruginosa, one of the nasty pathogens! The authors analysed 174 molecular properties in 1260 antimicrobial compounds and study their correlations with antibacterial of the bacteria. The good news is that the authors have found some key cluster of predictors that are important for the permeability of a candidate drug!

New visual models for medical applications 📷

1/ PaliGemma

Google has done it again! Last week we talked about the MedGemini family of models for medical applications. Now, Google has released their PaliGemma open source models. These models excel in general Visual Question Answering (VQA) … ie it can detect objects on images, or even generate segmentation masks. The good thing about open-source is that everyone can improve them to solve specific needs. A researcher has fine-tuned (ie. improved PaliGemma with custom data) the model to recognise fractures from X-ray scans. He also made available his code so you can fine-tune the model with your own data! Check out his guide in here!

2/ MRSegmentator

When dealing with CT and MRI scans in a medical settings, one of the first thing to do is to “segment” the image. Segmentation allows to isolate specific structures and tissues from a scan, helping medical practitioners to do precise analysis and plan interventions, such as surgeries or radiation therapy, by providing detailed anatomical insights. The good news: AI is very good in dealing with image data

This week, a new vision AI model has been published. MRSegmentator can segment 40 different anatomical structures in MRI and CT scans and has been trained on ~2,500 MRI and CT scans (1,200 manually annotated MRI scans from the UK Biobank, 221 in-house MRI scans and 1228 CT scans). To start using the tool check out their GitHub repo!

New datasets & benchmarks 💿

1/ MISATO - Drug Discovery dataset

In the world of AI, having high quality and usable data is everything. For example, the development of AlphaFold was possible only because large, high-quality public databases for proteins existed (in this case Protein Data Bank with 200,000+ protein structures being deposited). However, large high-quality public databases for drug discovery do not exist. How is it different from protein databases like Protein Data Bank? Because the end-to-end drug discovery process needs more information than the protein structure, such as three-dimensional (3D) biomolecule–ligand data.

These authors created MISATO, a ML-ready dataset based on experimental protein–ligand structures, which includes:

Quantum Mechanics: 19443 ligands, curated and refined
Molecular Dynamics: 16972 simulated protein-ligand structures, 10 ns each
Moreover, they have calculated a lot of protein-ligand properties (such as docking), validated them in the lab and made the measures available

The code can be found in here

1/ DANCE - Deep learning library and Benchmark for single-cell analysis

Apart from having the right data for training AI models, we need to have comprehensive tests and data (ie. benchmarks) that allow us to compare models with each others. In single cell analysis, such benchmark have been somewhat lacking. Researchers at Michigan State University are fixing this by developing DANCE, a single large benchmark python library with 8 popular tasks, 32 state-of-art methods on 21 benchmark datasets. If you are working on single-cell, you can check out DANCE in here.

Looking forward to seeing new AI models for single cell in the coming months 🚀

Life Science tools of the week 🛠️

1/ tFold - Antibody design

Antibodies (Abs) are essential in the adaptive immune system as they specifically recognize and neutralize antigens (Ags). The specificity of Abs toward Ags positions them as highly promising therapeutics. To this end, predicting the structures of Ab-Ag complexes is a priority to identify new drugs and many models have been developed in the past year. Researchers from the Tencent AI Lab have recently developed tFold, the latest state-of-the-art (SOTA) model for predicting 3D atomic-resolution structures of antibodies (Abs) and antibody-antigen (Ab-Ag) complexes.

As it’s often the case in this world, the authors repurposed and re-trained a general protein language models with ~10,000 Abs and ~5,000 Abs-Ags curated from SAbDab. Key message: always focus on the right data for your use case, once you have that, finding models to repurpose is easy!

The model is available for non-commercial purposes in here

2/ EVO - Genomic LLM

While EVO is not a new model, we wanted to write about it given the importance we think it will have on genomics. EVO is an LLM applied for genomic data from the Arc Institute which was trained on 80,000 genomes of bacteria and archaea. The idea is to pick up on the logic of genomic grammar, such as which amino acids tend to go together, what biological functions are performed by different genes, and which of those functions an organism absolutely needs to survive. While other genomic LLMs exist, EVO stands out for 1) being largest DNA pretraining dataset publicly available, especially for non-human sequences and 2) the richness of downstream tasks it covers, such as

- Predicting mutational effects on protein function
- Predicting mutational effects on ncRNA function
- Predicting gene expression from regulatory DNA
- Generative design of CRISPR-Cas molecular complexes
- Generative design of transposable biological systems
- Predicting gene essentiality with long genomic context
- Generating DNA sequences at genome scale

The open source code can be found in here

3/ RNAErnie - RNA language model

There are more and more large language models (LLMs) applied to Life science. LLMs understand the relationship between “units” in order to predict how different units interact. Researchers form the Big Data Lab (under Baidu) in China have released a paper this week on their LLM applied to 50,000 RNA sequences. Why is it important? Cause this tool beats other state-of-the-art (SOTA) tools to:

Classify RNA sequence
Predict RNA-RNA interactions
Predict RNA secondary structure

The open-source code can be found in here

BITE-SIZED COOKIES FOR THE WEEK 🍪

We wrote our first spotlight on different therapeutic modalities, including the advantages, limitations and major players!

Clinical trials are going digital! The European Medical Agency dictated that all clinical trials ongoing after 30 Jan 2025 must be transitioned from
eudraCT to CTIS. Here a guide of what that means!

Genetic cause of the rare Spinocerebellar ataxia type 4 (SCA4) disease was finally found after a 25 years search! The first genetic link to SCA4 was established in 1996 … but the exact mutation causing it was unknown. Thanks to long read whole genome sequence we now know the mutation … meaning we will be able to diagnose this disease even before a person is born with prenatal genetic testing such as NIPT!

Amazing news for AI-based drug development! InSilico Medicine, a pioneer of AI in drug discovery, moved their cancer candidate drug co-developed with Fosun Pharma into into Phase I clinical trial.

Don’t forget to take our short survey to help us understand how you currently use AI in your day-to-day!

Start Survey

Kiin Bio Weekly

Discussion about this post

Ready for more?