UNAAGI: Atom-Level Diffusion for Generating Non-Canonical Amino Acid Substitutions
Abstract
Proposing beneficial amino acid substitutions, whether for mutational effect prediction or protein engineering, remains a central challenge in structural biology. Recent inverse folding models, trained to reconstruct sequences from structure, have had considerable impact in identifying functional mutations. However, current approaches are constrained to designing sequences composed exclusively of natural amino acids (NAAs). The larger set of non-canonical amino acids (NCAAs), which offer greater chemical diversity, and are frequently used in in-vivo protein engineering, remain largely inaccessible for current variant effect prediction methods. To address this gap, we introduce \textbf{UNAAGI}, a diffusion-based generative model that reconstructs residue identities from atomic-level structure using an E(3)-equivariant framework. By modeling side chains in full atomic detail rather than as discrete tokens, UNAAGI enables the exploration of both canonical and non-canonical amino acid substitutions within a unified generative paradigm. We evaluate our method on experimentally benchmarked mutation effect datasets and demonstrate that it achieves substantially improved performance on NCAA substitutions compared to the current state-of-the-art. Furthermore, our results suggest a shared methodological foundation between protein engineering and structure-based drug design, opening the door for a unified training framework across these domains.
Summary
The paper introduces UNAAGI, a novel diffusion-based generative model for proposing amino acid substitutions in proteins, including non-canonical amino acids (NCAAs). The core problem addressed is the limitation of current inverse folding models, which are restricted to using only natural amino acids (NAAs) and are unable to leverage the wider chemical diversity offered by NCAAs. UNAAGI overcomes this limitation by modeling side chains in full atomic detail, rather than as discrete tokens, using an E(3)-equivariant framework. This enables the exploration of both canonical and non-canonical amino acid substitutions within a unified generative paradigm. The authors evaluate UNAAGI on experimentally benchmarked mutation effect datasets, demonstrating improved performance on NCAA substitutions compared to the state-of-the-art. Specifically, they correlate the generative likelihoods of UNAAGI with experimentally measured mutational effects. The key contribution is the development of a generative model capable of suggesting and evaluating NCAA substitutions, which is significant because NCAAs offer enhanced functionalities for protein engineering and synthetic biology. Furthermore, the paper highlights the potential for a shared methodological foundation between protein engineering and structure-based drug design (SBDD) due to the similar non-covalent interaction principles governing both fields.
Key Insights
- •UNAAGI employs an E(3)-equivariant diffusion framework for atom-wise side-chain generation, providing a new approach to residue identity inference that goes beyond discrete token modeling.
- •The model demonstrates the ability to generalize to NCAAs that are chemically proximal to NAAs, suggesting a degree of transfer learning from the evolutionary selection pressure encoded in natural proteins.
- •UNAAGI shows meaningful correlation with experimental mutational effects for NCAA substitutions, representing a substantial improvement over existing methods like NCFlow, which showed poor or negative correlations.
- •On a subset of ProteinGym, UNAAGI maintains reasonable performance despite using a more expressive output distribution (atom-wise generation) compared to token-based models, showing promise for wider applications.
- •Ablation studies reveal that removing the independent NCAA data during training leads to a significant drop in performance, highlighting the importance of even limited NCAA data for model generalization.
- •Qualitative analysis of sampled NCAAs reveals that UNAAGI generates amino acids with diverse chemistries but tends to produce NCAAs that are structurally similar to or interpolate between natural amino acids.
- •UNAAGI achieves significantly higher wild-type coverage rates (0.9368) compared to PepINVENT (0.2365), indicating a greater capacity to reconstruct the original amino acid residue during sampling.
Practical Implications
- •UNAAGI can be applied to protein engineering for designing proteins with enhanced or novel functionalities by incorporating NCAAs, which offer greater chemical diversity than NAAs.
- •Researchers and engineers in synthetic biology and therapeutics can benefit from UNAAGI's ability to predict the effects of NCAA substitutions, enabling the design of proteins with tailored properties.
- •The model suggests a potential convergence between protein engineering and structure-based drug design, opening avenues for developing unified training frameworks that leverage techniques from both fields.
- •The approach of modeling side chains in full atomic detail can be extended to other areas of protein design and prediction, such as predicting protein-ligand binding affinities or designing novel protein structures.
- •Future research can focus on scaling UNAAGI, incorporating more NCAA data, and applying guidance techniques to enhance the generation of chemically diverse NCAAs with desirable properties.