synth_pdb.plm — Protein Language Model Embeddings

ESM-2 per-residue embeddings via HuggingFace Transformers.

Install the optional dependency first:

pip install synth-pdb[plm]

Quick Start

from synth_pdb.plm import ESM2Embedder

embedder = ESM2Embedder()   # lazy — model loads on first embed() call

# Per-residue embeddings
emb = embedder.embed("MQIFVKTLTGKTITLEVEPS")
print(emb.shape)   # (20, 320) — 20 residues × 320-dim float32

# From a biotite AtomArray
emb = embedder.embed_structure(atom_array)   # same shape as embed()

# Sequence-level cosine similarity
sim = embedder.sequence_similarity("ACDEF", "ACDEF")   # → 1.0
sim = embedder.sequence_similarity("ACDEF", "VWLYG")   # → ~0.7–0.9

!!! note "Lazy loading" ESM2Embedder() does nothing until you call embed(). This means from synth_pdb.plm import ESM2Embedder is always safe, even without torch or transformers installed.


Using a Larger Model

All ESM-2 variants share the same API:

# Default (8M params, 320-dim, ~30 MB)
embedder = ESM2Embedder()

# Better accuracy (35M params, 480-dim)
embedder = ESM2Embedder(model_name="facebook/esm2_t12_35M_UR50D")

# Near-production (150M params, 640-dim)
embedder = ESM2Embedder(model_name="facebook/esm2_t30_150M_UR50D")

API Reference

::: synth_pdb.plm.ESM2Embedder options: show_source: false members: - embed - embed_structure - mean_embed - sequence_similarity - embedding_dim


Practical Examples

Feed into GNN as node features

from synth_pdb.plm import ESM2Embedder
import numpy as np

plm = ESM2Embedder()
plm_features = plm.embed_structure(structure)   # (L, 320)

# Concatenate with your existing per-residue geometry features
node_features = np.concatenate([geometry_features, plm_features], axis=-1)

Secondary structure linear probe

import torch
import torch.nn as nn

plm = ESM2Embedder()
emb = torch.tensor(plm.embed("MQIFVKTLTGKTITLEVEPS"))   # (20, 320)

probe = nn.Linear(320, 3)   # 3 classes: Helix / Strand / Coil
logits = probe(emb)          # (20, 3)
probs = logits.softmax(-1)

Pairwise similarity matrix over a sequence library

import numpy as np

sequences = ["ACDEF", "ACDEF", "VWLYG", "RRKKK"]
plm = ESM2Embedder()
mean_embs = np.stack([plm.mean_embed(s) for s in sequences])   # (N, 320)

# Normalise rows, then dot-product → cosine similarity matrix
norms = np.linalg.norm(mean_embs, axis=1, keepdims=True)
normed = mean_embs / (norms + 1e-8)
sim_matrix = normed @ normed.T   # (N, N)

Background

See Protein Language Models for the full scientific background, model architecture diagram, and explanation of what the embedding dimensions encode.