msa Module

The msa module implements a physical sequence-level evolutionary simulator to generate Multiple Sequence Alignments (MSAs) with co-evolutionary constraints.

Overview

Based on Direct Coupling Analysis (DCA) theory, this module models sequence probability using a Potts Energy Model. It uses a Metropolis-Hastings Markov Chain Monte Carlo (MCMC) algorithm to simulate evolutionary drift, ensuring that produced sequences respect the native 3D fold (Contact Map).

Main Classes

::: synth_pdb.msa.CoevolutionModel options: show_root_heading: true show_source: true members: - init - calculate_energy - calculate_delta_energy

::: synth_pdb.msa.MetropolisHastingsSampler options: show_root_heading: true show_source: true members: - init - start - step

Main Functions

::: synth_pdb.msa.generate_msa options: show_root_heading: true show_source: true

Usage Examples

Generating a Synthetic MSA

import numpy as np
from synth_pdb.msa import generate_msa

# base_sequence: str
# contact_map: np.ndarray (N x N boolean)

msa = generate_msa(
    base_sequence="ACDEFGHIKL",
    contact_map=my_contact_map,
    num_sequences=50,
    temperature=1.0
)

for seq in msa:
    print(seq)

Educational Notes

Hydrophobic Core Collapse

Solvent Accessible Surface Area (SASA) is the physical mechanism mapping 3D structure back to 1D sequence constraints. If a residue is "buried" deep inside the protein core (low SASA), evolutionary drift must strictly eliminate charged/polar mutations. Placing a hydrophilic amino acid in the water-free hydrophobic core would rupture the hydrogen-bond network and unfold the protein. The msa module enforces this by penalizing hydrophilic mutations at buried positions.

Electrostatic Compatibility

Proteins use localized regions of electrical charge (Salt Bridges) to lock their tertiary folds into stable, lower-energy states. Conversely, placing two like-charges in close proximity causes strong Coulombic repulsion. The Potts model in this module rewards opposite-charge pairs while aggressively penalizing like-charge complexes in contacting residues.

The "Magic Step" Coupled Mutation

In traditional MCMC, only one site is mutated at a time. However, in evolution, getting from a [Large:Small] pair to a [Small:Large] pair is impossible if the intermediate [Large:Large] state causes a massive steric clash. The "Magic Step" proposes mutations at two contacting residues simultaneously, allowing the simulation to traverse these steric gaps and capture true Direct Coupling covariance.

References

  • Direct Coupling Analysis (DCA): Morcos, F., et al. (2011). "Direct-coupling analysis of residue coevolution captures native contacts across many protein families." Proceedings of the National Academy of Sciences (PNAS). DOI: 10.1073/pnas.1111471108
  • Potts Models in Evolution: Weigt, M., et al. (2009). "Identification of direct residue contacts in protein-protein interaction by message passing." PNAS. DOI: 10.1073/pnas.0805923106

See Also