EFI - Enzyme Similarity Tool

The Nugget

  • In generating enzyme sequence similarity networks, choosing an appropriate alignment score is crucial for clustering sequences accurately to represent isofunctional families.

Key quotes

  • "Networks are best interpreted with an alignment score upper limit that gathers the sequences into clusters that represent families with only a single function."
  • "If a sufficient number of annotations is available, the optimum alignment score is determined empirically by mapping known functions onto the network."
  • "The length histogram allows an assessment of length heterogeneity in your dataset."

Key insights

Choosing an Appropriate Alignment Score

  • The alignment score used should be an empirical decision based on mapping known functions onto the network.
  • An alignment score too large can fracture the network while too small can merge multiple families into a single cluster.
  • It's recommended to output the initial sequence similarity network (SSN) with a lower alignment score to avoid splitting isofunctional families.

Dataset Analysis for Alignment Score Selection

  • The number of edges histogram, length histogram, and quartile plots are used to guide the selection of the alignment score for generating SSNs.
  • Assessment of length heterogeneity in the dataset is crucial for determining the alignment score.
  • Severely truncated fragments can potentially affect the quartile plots and alignment score selection.

Examples of Alignment Score Selection

Example 1: Single Domain Proteins

  • For single domain proteins, the alignment score should correspond to ~35% sequence identity initially.
  • Emphasis should be on the portions of quartile plots where alignment score calculations are based on the full length of the sequence.
  • Use the filter function in Cytoscape to remove edges with alignment scores larger than the initial value to generate SSNs accurately.

Example 2: Multi-Domain Proteins

  • Multi-domain proteins may require an initial alignment score of 100 to accurately represent isofunctional clusters.
  • Bimodal length histograms and quartile plots are indicative of the complexity of alignment score selection for multi-domain proteins.
  • Aligning to single-domain lengths and monitoring the percent identity can help select the appropriate alignment score for generating networks.

Make it stick

  • 💡 Choosing the right alignment score is key for accurately clustering enzyme sequences.
  • 💡 Utilize quartile plots and length histograms to guide alignment score selection.
  • 💡 For single-domain proteins, start with an alignment score corresponding to ~35% sequence identity.
  • 💡 Multi-domain proteins may require a higher initial alignment score, such as 100, for accurate clustering.
This summary contains AI-generated information and may have important inaccuracies or omissions.