Features
DDGemb exploits the power of ESM2 protein language model (Lin et al., 2023) for protein and variant representation in combination with a deep-learning architecture based on a Transformer encoder (Vaswani et al., 2017) to predict the ΔΔG for single- and multi-point variations.
The input variation is firstly encoded as the difference between the ESM2 embeddings of wild-type and variant sequences. The difference matrix is then provided in input to a deep architecture including convolutional and transformer encoder layers.
Input
From the Home Page, a standard DDGemb job including up to 100 variations on a single protein sequence can submited. This requires:
- A protein sequence in FASTA format (shorter than 2,500 residues);
- A list of variations in the XPOSY format, one per line (e.g. S11C). Multi-point variations are reported in the same line, comma-seprated (e.g. S11C,F18L).
From the Batch job Page, it is possible to analyze up to 2000 variations (either single- or multi-point) occurring on at most 500 protein sequences. This requires:
- A FASTA file containing the set of protein sequences;
- A variation file containing the list of variations, one per line. Each line must contain the ID of the protein (as reported in the fasta file) followed by a variation in the XPOSY format (e.g. P02417 I115A). Multi-point variations are reported in the same line, comma-seprated (e.g. P02417 F5V,K12M).
All variations are checked against the protein sequence (positions must be within the protein length and the wild-type residues need to be consistent). Once validated, the job will be submitted.
Output
There are four main sections in the result page of a standard DDGemb job:
- Job Information:
general information about the job are present, including the Job ID, the date of submission and completion, the protein ID, the protein length, and the number of single-point and multi-point variants submitted.
Three buttons are also made available to copy the URL of the results page, and to download the results in JSON and TSV formats. - Predicted single-point variant ΔΔG:
If any was submitted, a table will display the predicted values for each single-point variant. The table has four columns, including the wild-type residue, the variant position, the variant residue and the predicted ΔΔG. - Single-point variant sequence mapping:
If any was submitted, single-point variants are also visualized with the neXtProt feature viewer. Each variant is reported along the protein length and it is colored based on the predicted classification (Destabilizing, Weakly destabilizing, Neutral, Weakly stabilizing, Stabilizing). - Predicted multi-point variant ΔΔG:
If any was submitted, a table will display the predicted values for each multi-point variant. In this case, the table has two columns, including the variations and the predicted ΔΔG.
For Batch jobs, the result page will instead show two buttons to download results in JSON and TSV format.