In generating enzyme sequence similarity networks, choosing an appropriate alignment score is crucial for clustering sequences accurately to represent isofunctional families.
"Networks are best interpreted with an alignment score upper limit that gathers the sequences into clusters that represent families with only a single function."
"If a sufficient number of annotations is available, the optimum alignment score is determined empirically by mapping known functions onto the network."
"The length histogram allows an assessment of length heterogeneity in your dataset."
Key insights
Choosing an Appropriate Alignment Score
The alignment score used should be an empirical decision based on mapping known functions onto the network.
An alignment score too large can fracture the network while too small can merge multiple families into a single cluster.
It's recommended to output the initial sequence similarity network (SSN) with a lower alignment score to avoid splitting isofunctional families.
Dataset Analysis for Alignment Score Selection
The number of edges histogram, length histogram, and quartile plots are used to guide the selection of the alignment score for generating SSNs.
Assessment of length heterogeneity in the dataset is crucial for determining the alignment score.
Severely truncated fragments can potentially affect the quartile plots and alignment score selection.
Examples of Alignment Score Selection
Example 1: Single Domain Proteins
For single domain proteins, the alignment score should correspond to ~35% sequence identity initially.
Emphasis should be on the portions of quartile plots where alignment score calculations are based on the full length of the sequence.
Use the filter function in Cytoscape to remove edges with alignment scores larger than the initial value to generate SSNs accurately.
Example 2: Multi-Domain Proteins
Multi-domain proteins may require an initial alignment score of 100 to accurately represent isofunctional clusters.
Bimodal length histograms and quartile plots are indicative of the complexity of alignment score selection for multi-domain proteins.
Aligning to single-domain lengths and monitoring the percent identity can help select the appropriate alignment score for generating networks.
Make it stick
💡 Choosing the right alignment score is key for accurately clustering enzyme sequences.
💡 Utilize quartile plots and length histograms to guide alignment score selection.
💡 For single-domain proteins, start with an alignment score corresponding to ~35% sequence identity.
💡 Multi-domain proteins may require a higher initial alignment score, such as 100, for accurate clustering.
This summary contains AI-generated information and may have important inaccuracies or omissions.