batch_generator Module
The batch_generator module provides high-performance vectorized protein structure generation optimized for deep learning and large-scale simulation.
Overview
Unlike the serial generator, batch_generator leverages NumPy's vectorized operations to build hundreds or thousands of structures in parallel. This approach is "ML-Ready" - producing contiguous tensors that can be passed directly to frameworks like MLX, PyTorch, or JAX.
Main Classes
::: synth_pdb.batch_generator.BatchedGenerator options: show_root_heading: true show_source: true members: - init - generate_batch
::: synth_pdb.batch_generator.BatchedPeptide options: show_root_heading: true show_source: true members: - init - to_pdb - save_pdb - get_6d_orientations - analyze_ensemble
Usage Examples
Batched Structure Generation
Generate 100 alpha-helical structures in a single vectorized pass.
from synth_pdb.batch_generator import BatchedGenerator
# Create generator
gen = BatchedGenerator("ALA-GLY-SER-LEU-VAL", n_batch=100)
# Generate batch
batch = gen.generate_batch(conformation="alpha")
# Output: Batch coordinate tensor (100, N_atoms, 3)
coords = batch.coords
Ensemble Analysis
Perform NMR-style analysis on the generated batch to find the medoid structure and average RMSD.
analysis = batch.analyze_ensemble(superimpose=True)
print(f"Medoid index: {analysis['medoid_index']}")
print(f"Average RMSD: {analysis['avg_rmsd']:.2f} Å")
Exporting Orientograms
Extract 6D inter-residue orientations (distances, $\omega, \theta, \phi$ torsions) for all pairs.
orientations = batch.get_6d_orientations()
# orientations['dist'] is a (100, L, L) distance tensor
Educational Notes
Batched Generation (GPU-First)
Traditional generators process structures one-by-one. batch_generator uses Vectorized Math to:
1. Broadcasting: Using NumPy's broadcasting, a single mathematical expression calculates positions for all members of the batch simultaneously.
2. Hardware Acceleration: On modern architectures (like Apple Silicon M4), this leverages AMX/Accelerate units, often providing 10-100x speedups over Python loops.
The "Memory Wall" in AI Training
When generating millions of samples, the bottleneck is often the "Memory Wall":
- Latency: Copying large tensors from CPU to GPU memory can be slower than the math itself.
- Contiguity: Deep Learning models require contiguous memory. BatchedGenerator ensures the output is one massive C-style array, avoiding the overhead of "gather" operations on Python lists.
- Unified Memory: On unified memory architectures, the coordinate tensor can be "zero-copy" - generated by NumPy and immediately visible to the GPU without movement.
Peptidyl Chain Walk
The module implements a vectorized NeRF (Natural Extension Reference Frame) walk. It places atoms for ALL structures in the batch iteratively: 1. Place N for all members. 2. Place CA for all members using $N(i), CA(i-1), C(i-1)$. 3. Place C for all members. 4. Place O for all members.
See Also
- generator Module - Serial structure generation
- dataset Module - Bulk dataset orchestration
- Scientific Background: NeRF Geometry