batch_generator Module

The batch_generator module provides high-performance vectorized protein structure generation optimized for deep learning and large-scale simulation.

Overview

Unlike the serial generator, batch_generator leverages NumPy's vectorized operations to build hundreds or thousands of structures in parallel. This approach is "ML-Ready" - producing contiguous tensors that can be passed directly to frameworks like MLX, PyTorch, or JAX.

Main Classes

::: synth_pdb.batch_generator.BatchedGenerator options: show_root_heading: true show_source: true members: - init - generate_batch

::: synth_pdb.batch_generator.BatchedPeptide options: show_root_heading: true show_source: true members: - init - to_pdb - save_pdb - get_6d_orientations - analyze_ensemble

Usage Examples

Batched Structure Generation

Generate 100 alpha-helical structures in a single vectorized pass.

from synth_pdb.batch_generator import BatchedGenerator

# Create generator
gen = BatchedGenerator("ALA-GLY-SER-LEU-VAL", n_batch=100)

# Generate batch
batch = gen.generate_batch(conformation="alpha")

# Output: Batch coordinate tensor (100, N_atoms, 3)
coords = batch.coords

Ensemble Analysis

Perform NMR-style analysis on the generated batch to find the medoid structure and average RMSD.

analysis = batch.analyze_ensemble(superimpose=True)
print(f"Medoid index: {analysis['medoid_index']}")
print(f"Average RMSD: {analysis['avg_rmsd']:.2f} Å")

Exporting Orientograms

Extract 6D inter-residue orientations (distances, $\omega, \theta, \phi$ torsions) for all pairs.

orientations = batch.get_6d_orientations()
# orientations['dist'] is a (100, L, L) distance tensor

Educational Notes

Batched Generation (GPU-First)

Traditional generators process structures one-by-one. batch_generator uses Vectorized Math to: 1. Broadcasting: Using NumPy's broadcasting, a single mathematical expression calculates positions for all members of the batch simultaneously. 2. Hardware Acceleration: On modern architectures (like Apple Silicon M4), this leverages AMX/Accelerate units, often providing 10-100x speedups over Python loops.

The "Memory Wall" in AI Training

When generating millions of samples, the bottleneck is often the "Memory Wall": - Latency: Copying large tensors from CPU to GPU memory can be slower than the math itself. - Contiguity: Deep Learning models require contiguous memory. BatchedGenerator ensures the output is one massive C-style array, avoiding the overhead of "gather" operations on Python lists. - Unified Memory: On unified memory architectures, the coordinate tensor can be "zero-copy" - generated by NumPy and immediately visible to the GPU without movement.

Peptidyl Chain Walk

The module implements a vectorized NeRF (Natural Extension Reference Frame) walk. It places atoms for ALL structures in the batch iteratively: 1. Place N for all members. 2. Place CA for all members using $N(i), CA(i-1), C(i-1)$. 3. Place C for all members. 4. Place O for all members.

See Also