Computational Complexity of Dissimilarity Measures: A Comprehensive Guide
Introduction
When comparing sequences, whether they represent career trajectories, family formation patterns, or health states, you need efficient algorithms that can scale to relatively large datasets. Computational complexity tells you how the runtime of an algorithm grows as your data grows, helping you choose the right dissimilarity measure for your problem size.
This tutorial explains what computational complexity is, why it matters (especially for big data), and provides a detailed breakdown of the complexity of different dissimilarity measures implemented in Sequenzo. We'll start with the fundamentals and then dive into specific algorithms, showing you exactly how to analyze their performance.
What is Computational Complexity?
Computational complexity (also called time complexity) describes how the runtime of an algorithm increases as the size of the input increases. It's expressed using Big O notation, which focuses on the dominant term and ignores constants and lower-order terms.
Key Concepts
Big O Notation
Big O notation describes the worst-case or average-case behavior of an algorithm:
- O(1): Constant time—runtime doesn't depend on input size
- O(log n): Logarithmic time—runtime grows slowly (e.g., binary search)
- O(n): Linear time—runtime grows proportionally with input size
- O(n log n): Linearithmic time—common in efficient sorting algorithms
- O(n²): Quadratic time—runtime grows with the square of input size
- O(n³): Cubic time—runtime grows with the cube of input size
- O(2ⁿ): Exponential time—runtime doubles with each additional input element
Space Complexity
Space complexity describes how much memory an algorithm needs. It's also expressed in Big O notation and follows the same principles as time complexity.
What Complexity Includes
When analyzing dissimilarity measures, complexity includes:
- Per-pair computation: The cost of comparing two sequences
- Pairwise matrix construction: The cost of computing all pairwise distances
- Preprocessing: Any setup costs (e.g., building substitution cost matrices)
- Memory usage: Storage requirements for intermediate results
Why Complexity Matters: The Big Data Perspective
Understanding complexity becomes critical when working with large datasets:
The Scaling Problem
Consider computing a distance matrix for sequences:
- Small dataset (100 sequences): ~5,000 pairwise comparisons
- Medium dataset (1,000 sequences): ~500,000 pairwise comparisons
- Large dataset (10,000 sequences): ~50,000,000 pairwise comparisons
If each comparison takes 1 millisecond:
- Small: 5 seconds
- Medium: 8.3 minutes
- Large: 13.9 hours
But if the algorithm is O(n²) per pair instead of O(n), and sequences average 50 elements:
- Small: 5 seconds × 50 = 4.2 minutes
- Medium: 8.3 minutes × 50 = 6.9 hours
- Large: 13.9 hours × 50 = 29 days
Real-World Impact
- Resource constraints: Memory and CPU limits become bottlenecks
- User experience: Long wait times make interactive analysis impossible
- Cost: Cloud computing costs scale with runtime
- Feasibility: Some algorithms become impractical for large datasets
Choosing the Right Algorithm
Understanding complexity helps you:
- Select appropriate measures for your dataset size
- Estimate runtime before starting computation
- Optimize workflows by choosing faster alternatives when possible
- Plan infrastructure (e.g., parallelization, distributed computing)
Classification of Dissimilarity Measures
Dissimilarity measures in Sequenzo can be classified into several categories based on their algorithmic approach:
1. Edit Distance Measures
These measures compute the minimum cost to transform one sequence into another through insertions, deletions, and substitutions.
Examples: OM (Optimal Matching), OMloc, OMslen, OMspell, OMstran, TWED
2. Prefix/Suffix Matching Measures
These measures compare sequences position-by-position from the start (prefix) or end (suffix).
Examples: LCP (Longest Common Prefix), RLCP (Reverse LCP), LCPspell, RLCPspell
3. Subsequence Matching Measures
These measures find the longest common subsequence (not necessarily contiguous).
Examples: LCS (Longest Common Subsequence)
4. Distribution-Based Measures
These measures compare state distributions or feature vectors rather than sequence structure.
Examples: CHI2 (Chi-square), EUCLID (Euclidean)
5. Position-Based Measures
These measures compare sequences element-by-element at corresponding positions.
Examples: HAM (Hamming distance), DHD (Dynamic Hamming Distance)
6. Spell-Based Measures
These measures operate on spell representations (consecutive state occurrences).
Examples: OMspell, LCPspell, NMS (Number of Matching Spells), NMSMST, SVRspell
Complexity Summary Table
The following table summarizes the computational complexity of different dissimilarity measures. Here, n is the number of sequences, m is the average sequence length, and k is the number of features (for distribution-based measures).
| Measure | Per-Pair Complexity | Full Matrix Complexity | Space Complexity | Notes |
|---|---|---|---|---|
| LCP / RLCP | O(min(m₁, m₂)) | O(n² × m) | O(n²) | Linear scan, very fast |
| HAM | O(m) | O(n² × m) | O(n²) | Simple element-wise comparison |
| CHI2 / EUCLID | O(k) | O(n² × k) | O(n² + n×k) | k = number of features/columns |
| LCS | O(m₁ × m₂) | O(n² × m²) | O(n² + m²) | Dynamic programming, quadratic per pair |
| OM | O(m₁ × m₂) | O(n² × m²) | O(n² + m²) | Dynamic programming with substitution costs |
| OMloc | O(m₁ × m₂) | O(n² × m²) | O(n² + m²) | OM with local spell transformation |
| OMslen | O(m₁ × m₂) | O(n² × m²) | O(n² + m²) | OM with sequence length normalization |
| OMspell | O(m₁ × m₂) | O(n² × m²) | O(n² + m²) | OM on spell representation |
| OMstran | O(m₁ × m₂) | O(n² × m²) | O(n² + m²) | OM with spell transformation |
| TWED | O(m₁ × m₂) | O(n² × m²) | O(n² + m²) | Time Warp Edit Distance |
| DHD | O(m₁ × m₂) | O(n² × m²) | O(n² + m²) | Dynamic Hamming Distance |
| LCPspell / RLCPspell | O(min(s₁, s₂)) | O(n² × s) | O(n²) | s = number of spells |
| LCPmst / RLCPmst | O(min(s₁, s₂)) | O(n² × s) | O(n²) | MST variant |
| LCPprod / RLCPprod | O(min(s₁, s₂)) | O(n² × s) | O(n²) | Product variant |
| NMS | O(s₁ × s₂) | O(n² × s²) | O(n² + s²) | Spell matching |
| NMSMST | O(s₁ × s₂) | O(n² × s²) | O(n² + s²) | MST variant |
| SVRspell | O(s₁ × s₂) | O(n² × s²) | O(n² + s²) | Spell-based variant |
Key observations:
- Fastest: LCP/RLCP and HAM are linear per pair
- Moderate: CHI2/EUCLID depend on feature count (often smaller than sequence length)
- Slower: Edit distance measures (OM, LCS, TWED) are quadratic per pair
- Spell-based: Complexity depends on number of spells, which is typically ≤ sequence length
Detailed Algorithm Analysis
Let's examine each category of algorithms in detail, understanding why they have their specific complexities.
Category 1: Prefix/Suffix Matching (LCP, RLCP)
Algorithm: Compare sequences position-by-position from the start (LCP) or end (RLCP) until finding the first mismatch.
Complexity Analysis:
# Pseudocode for LCP
def compute_lcp(seq1, seq2):
length = 0
min_len = min(len(seq1), len(seq2))
# Linear scan: O(min(len(seq1), len(seq2)))
while length < min_len and seq1[length] == seq2[length]:
length += 1
return lengthWhy O(min(m₁, m₂))?
- We scan at most
min(len(seq1), len(seq2))positions - Each comparison is O(1)
- We stop at the first mismatch
Real Implementation (from LCPdistance.cpp):
double compute_distance(int is, int js) {
int m = len;
int n = len;
int minimum = m;
if(n < m) minimum = n;
int length = 0;
auto ptr_seq = sequences.unchecked<2>();
if(sign > 0){ // LCP: forward
while(ptr_seq(is, length) == ptr_seq(js, length) && length < minimum){
length++;
}
} else{ // RLCP: backward
length = 1;
while(ptr_seq(is, (m - length)) == ptr_seq(js, (n - length)) && length <= minimum){
length++;
}
length--;
}
return normalize_distance(n+m-2.0*length, n+m, m, n, norm);
}Performance Characteristics:
- Best case: O(1) if first elements differ
- Average case: O(min(m₁, m₂)/2) for random sequences
- Worst case: O(min(m₁, m₂)) if sequences are identical or differ only at the end
When to Use: Very fast, good for quick comparisons or when prefix/suffix similarity is most important.
Category 2: Element-Wise Comparison (HAM)
Algorithm: Compare sequences element-by-element at corresponding positions, counting mismatches.
Complexity Analysis:
# Pseudocode for Hamming distance
def compute_hamming(seq1, seq2):
if len(seq1) != len(seq2):
raise ValueError("Sequences must have same length")
distance = 0
# Linear scan: O(m)
for i in range(len(seq1)):
if seq1[i] != seq2[i]:
distance += 1
return distanceWhy O(m)?
- We examine each of the m positions exactly once
- Each comparison and increment is O(1)
- Total: m × O(1) = O(m)
Note: HAM requires sequences of equal length. For sequences of different lengths, DHD (Dynamic Hamming Distance) is used, which has O(m₁ × m₂) complexity due to dynamic programming alignment.
Performance Characteristics:
- Best/Worst/Average: O(m) - always linear
- Very predictable runtime
When to Use: Fastest option when sequences have equal length and you want simple position-wise comparison.
Category 3: Distribution-Based Measures (CHI2, EUCLID)
Algorithm: Convert sequences to feature vectors (state distributions), then compute distance between vectors.
Complexity Analysis:
# Pseudocode for CHI2/EUCLID
def compute_chi2(seq1_features, seq2_features, pdotj):
"""
seq1_features: vector of length k (feature counts/proportions)
seq2_features: vector of length k
pdotj: marginal weights of length k
"""
distance = 0.0
# Linear scan over features: O(k)
for i in range(k):
diff = seq1_features[i] - seq2_features[i]
if pdotj[i] > 0:
distance += (diff * diff) / pdotj[i]
return sqrt(distance)Why O(k)?
- k is the number of features (columns in the feature matrix)
- We iterate through k features once
- Each operation (subtraction, multiplication, division) is O(1)
- Total: k × O(1) = O(k)
Real Implementation (from CHI2distance.cpp):
double compute_distance(int i, int j) const {
auto am = allmat_.unchecked<2>();
auto pj = pdotj_.unchecked<1>();
double sum = 0.0;
// O(k) loop over features
for (int c = 0; c < n_cols_; c++) {
double d = am(i, c) - am(j, c);
double m = pj(c);
if (m > 0.0)
sum += (d * d) / m;
}
return std::sqrt(sum);
}Preprocessing Cost:
- Building feature matrix: O(n × m × k) - one-time cost
- Often k << m (fewer features than sequence length), making this efficient
Performance Characteristics:
- Per-pair: O(k) where k is typically much smaller than m
- Preprocessing: O(n × m × k) but done once
- Very efficient when k is small (e.g., k = number of states)
When to Use: Excellent for large datasets when you care about state distributions rather than exact sequence structure.
Category 4: Dynamic Programming Edit Distances (OM, LCS, TWED)
Algorithm: Use dynamic programming to find the minimum cost alignment between two sequences.
Complexity Analysis:
Optimal Matching (OM)
# Pseudocode for OM (simplified)
def compute_om(seq1, seq2, sm, indel):
m, n = len(seq1), len(seq2)
# DP table: dp[i][j] = min cost to align seq1[0:i] with seq2[0:j]
dp = [[0] * (n+1) for _ in range(m+1)]
# Initialize: cost of aligning empty prefixes
for i in range(1, m+1):
dp[i][0] = dp[i-1][0] + indel # deletions
for j in range(1, n+1):
dp[0][j] = dp[0][j-1] + indel # insertions
# Fill DP table: O(m × n)
for i in range(1, m+1):
for j in range(1, n+1):
# Three operations:
# 1. Substitute: dp[i-1][j-1] + sm[seq1[i-1]][seq2[j-1]]
# 2. Delete: dp[i-1][j] + indel
# 3. Insert: dp[i][j-1] + indel
dp[i][j] = min(
dp[i-1][j-1] + sm[seq1[i-1]][seq2[j-1]], # substitution
dp[i-1][j] + indel, # deletion
dp[i][j-1] + indel # insertion
)
return dp[m][n]Why O(m₁ × m₂)?
- We fill a DP table of size (m+1) × (n+1)
- Each cell requires O(1) computation (three comparisons)
- Total: (m+1) × (n+1) × O(1) = O(m × n)
Real Implementation (from OMdistance.cpp):
The actual implementation includes optimizations:
- Common prefix/suffix skipping: Reduces effective m and n
- Two-row DP: Uses O(min(m,n)) space instead of O(m×n) by keeping only previous and current rows
- State-dependent indel costs: Adds O(1) overhead per cell
double compute_distance(int is, int js, double* prev, double* curr) {
// ... prefix/suffix skipping ...
// Fill DP table: O(m × n) where m, n are effective lengths
for (int i = 1; i <= m; i++) {
for (int j = 1; j <= n; j++) {
// Three-way minimum computation
double sub_cost = prev[j-1] + get_substitution_cost(...);
double del_cost = prev[j] + get_indel(...);
double ins_cost = curr[j-1] + get_indel(...);
curr[j] = std::min({sub_cost, del_cost, ins_cost});
}
// Swap prev and curr
}
return normalize_distance(...);
}Longest Common Subsequence (LCS)
LCS uses the same DP structure but simpler costs:
# Pseudocode for LCS
def compute_lcs(seq1, seq2):
m, n = len(seq1), len(seq2)
dp = [[0] * (n+1) for _ in range(m+1)]
# O(m × n) DP table fill
for i in range(1, m+1):
for j in range(1, n+1):
if seq1[i-1] == seq2[j-1]:
dp[i][j] = dp[i-1][j-1] + 1 # match
else:
dp[i][j] = max(dp[i-1][j], dp[i][j-1]) # skip one
# Distance = m + n - 2 * LCS_length
return m + n - 2 * dp[m][n]Complexity: O(m₁ × m₂) - same as OM, but slightly faster due to simpler operations (no substitution cost lookup).
Real Implementation (from LCSdistance.cpp):
int compute_LCS_length(int is, int js) {
const int m = ptr_len(is);
const int n = ptr_len(js);
if (m == 0 || n == 0) return 0;
// Two-row DP: O(min(m,n)) space
std::vector<int> prev(n + 1, 0);
std::vector<int> curr(n + 1, 0);
// O(m × n) time
for (int i = 1; i <= m; i++) {
for (int j = 1; j <= n; j++) {
if (si == sj) {
curr[j] = 1 + prev[j - 1];
} else {
curr[j] = std::max(prev[j], curr[j - 1]);
}
}
prev = curr; // Move to next row
}
return prev[n];
}Time Warp Edit Distance (TWED)
TWED extends OM with time warping, adding a stiffness parameter:
Complexity: O(m₁ × m₂) - same DP structure as OM, but with additional time warping calculations per cell.
Real Implementation (from TWEDdistance.cpp):
The implementation follows the same DP pattern but includes:
- Time warping constraints
- Stiffness parameter (ν)
- Gap penalty (λ)
Performance Characteristics (for all DP-based measures):
- Best case: O(min(m₁, m₂)) with aggressive prefix/suffix skipping
- Average case: O(m₁ × m₂)
- Worst case: O(m₁ × m₂) when sequences are completely different
- Space: Can be optimized to O(min(m₁, m₂)) using two-row DP
When to Use:
- OM: Most flexible, handles insertions, deletions, substitutions
- LCS: Faster than OM when you only care about subsequence matching
- TWED: When temporal alignment matters (e.g., time series)
Category 5: Spell-Based Measures
Algorithm: Convert sequences to spell representation (consecutive state occurrences), then compare spells.
Complexity Analysis:
LCPspell / RLCPspell
# Pseudocode for LCPspell
def compute_lcp_spell(seq1_spells, seq2_spells):
"""
seq1_spells: list of (state, duration) tuples
seq2_spells: list of (state, duration) tuples
"""
length = 0
min_spells = min(len(seq1_spells), len(seq2_spells))
# Linear scan: O(min(s₁, s₂)) where s = number of spells
while length < min_spells and seq1_spells[length][0] == seq2_spells[length][0]:
length += 1
return lengthWhy O(min(s₁, s₂))?
- s₁, s₂ are the number of spells (typically ≤ sequence length)
- Same linear scan as LCP, but on spells instead of positions
- Often faster than position-based LCP because s ≤ m
OMspell
# Pseudocode for OMspell
def compute_om_spell(seq1_spells, seq2_spells, sm, indel):
s1, s2 = len(seq1_spells), len(seq2_spells)
# DP table: O(s₁ × s₂)
dp = [[0] * (s2+1) for _ in range(s1+1)]
# Same DP structure as OM, but on spells
for i in range(1, s1+1):
for j in range(1, s2+1):
# Spell substitution, deletion, insertion
dp[i][j] = min(
dp[i-1][j-1] + spell_substitution_cost(...),
dp[i-1][j] + spell_deletion_cost(...),
dp[i][j-1] + spell_insertion_cost(...)
)
return dp[s1][s2]Why O(s₁ × s₂)?
- Same DP structure as OM, but operating on spells
- Typically faster than OM because s ≤ m (fewer spells than positions)
Performance Characteristics:
- Spell conversion: O(m) one-time cost per sequence
- LCPspell: O(min(s₁, s₂)) per pair - very fast
- OMspell: O(s₁ × s₂) per pair - faster than OM when s << m
When to Use:
- When spell structure matters more than exact positions
- Often faster than position-based measures
- Good for sequences with long runs of the same state
Full Distance Matrix Complexity
When computing a full distance matrix for n sequences, you need to compute n(n-1)/2 pairwise distances (since the matrix is symmetric).
General Formula
For an algorithm with per-pair complexity O(f(m)), the full matrix complexity is:
O(n² × f(m))
Examples
LCP: O(n² × m)
- n² pairs × O(m) per pair
OM: O(n² × m²)
- n² pairs × O(m²) per pair
CHI2: O(n² × k + n × m × k)
- n² pairs × O(k) per pair + O(n × m × k) preprocessing
Parallelization
Many algorithms can be parallelized:
- Independent pairs: Each pairwise comparison is independent
- Parallel speedup: Up to P× faster with P processors
- Complexity remains: O(n² × f(m) / P) with P processors
Practical Considerations
Memory Usage
Space complexity matters as much as time complexity:
| Measure | Space for Full Matrix | Additional Space |
|---|---|---|
| LCP/RLCP | O(n²) | O(1) |
| HAM | O(n²) | O(1) |
| CHI2/EUCLID | O(n²) | O(n × k) for feature matrix |
| OM/LCS/TWED | O(n²) | O(m²) for DP table (can be O(m)) |
| Spell-based | O(n²) | O(s²) for DP table |
Memory optimization:
- Use
full_matrix=Falseto get adistobject (stores only lower triangle) - Reduces memory by ~50%
Optimization Techniques
Prefix/Suffix Skipping: OM variants skip common prefixes/suffixes
- Reduces effective m, n
- Can provide 2-10× speedup for similar sequences
Early Termination: LCP stops at first mismatch
- Best case: O(1) instead of O(m)
Two-Row DP: OM/LCS use O(m) space instead of O(m²)
- Enables processing longer sequences
Vectorization: C++ implementations use SIMD (xsimd) for parallel operations
- 2-4× speedup on modern CPUs
Choosing the Right Measure
For small datasets (n < 100):
- Any measure works
- Choose based on interpretability
For medium datasets (100 < n < 1,000):
- Prefer LCP/RLCP, HAM, or CHI2/EUCLID
- OM variants are acceptable but slower
For large datasets (n > 1,000):
- Strongly prefer: LCP/RLCP, CHI2/EUCLID
- Avoid: Full OM matrix computation (consider reference sequences instead)
- Consider: Parallelization, distributed computing
For very long sequences (m > 100):
- Prefer spell-based measures (s << m)
- Consider CHI2/EUCLID (k << m)
Code Examples
Example 1: Comparing Complexity in Practice
from sequenzo import SequenceData, get_distance_matrix
import time
# Load your sequences
seqdata = SequenceData.from_dataframe(df, id_col='id', state_cols=['t1', 't2', ...])
# Fast: LCP (O(n² × m))
start = time.time()
dist_lcp = get_distance_matrix(seqdata, method="LCP", norm="auto")
time_lcp = time.time() - start
print(f"LCP took {time_lcp:.2f} seconds")
# Moderate: CHI2 (O(n² × k))
start = time.time()
dist_chi2 = get_distance_matrix(seqdata, method="CHI2", norm="auto")
time_chi2 = time.time() - start
print(f"CHI2 took {time_chi2:.2f} seconds")
# Slower: OM (O(n² × m²))
start = time.time()
dist_om = get_distance_matrix(seqdata, method="OM", sm="TRATE", norm="auto")
time_om = time.time() - start
print(f"OM took {time_om:.2f} seconds")
print(f"\nSpeedup: LCP is {time_om/time_lcp:.1f}× faster than OM")Example 2: Estimating Runtime
def estimate_runtime(n_seqs, avg_seq_len, method):
"""
Rough runtime estimates (in seconds) for different methods.
These are approximations - actual runtime depends on hardware,
sequence similarity, and implementation optimizations.
"""
n_pairs = n_seqs * (n_seqs - 1) / 2
# Per-pair time estimates (microseconds) on modern hardware
estimates = {
"LCP": 0.1 * avg_seq_len, # O(m)
"HAM": 0.1 * avg_seq_len, # O(m)
"CHI2": 0.05 * 10, # O(k), assume k≈10
"LCS": 0.5 * avg_seq_len ** 2, # O(m²)
"OM": 1.0 * avg_seq_len ** 2, # O(m²), more complex
}
per_pair_us = estimates.get(method, 1.0 * avg_seq_len ** 2)
total_seconds = (n_pairs * per_pair_us) / 1_000_000
return total_seconds
# Estimate for 1000 sequences of length 50
n = 1000
m = 50
print(f"Estimated runtime for {n} sequences (avg length {m}):")
for method in ["LCP", "CHI2", "LCS", "OM"]:
est = estimate_runtime(n, m, method)
print(f" {method:4s}: {est:6.1f} seconds ({est/60:5.1f} minutes)")Example 3: Using Reference Sequences for Large Datasets
Instead of computing the full matrix, compute distances to a reference:
# Instead of full matrix: O(n² × m²)
# dist_matrix = get_distance_matrix(seqdata, method="OM", ...) # Slow!
# Use reference sequence: O(n × m²)
dist_to_ref = get_distance_matrix(
seqdata,
method="OM",
refseq=0, # Compare all sequences to first sequence
sm="TRATE",
norm="auto"
)
# Much faster for large n!Summary
Understanding computational complexity helps you:
- Choose appropriate algorithms for your dataset size
- Estimate runtime before starting computations
- Optimize workflows by selecting faster alternatives
- Plan infrastructure (parallelization, cloud resources)
Key Takeaways:
- Fastest: LCP/RLCP, HAM (O(m) per pair)
- Moderate: CHI2/EUCLID (O(k) per pair, often k << m)
- Slower: OM, LCS, TWED (O(m²) per pair)
- Spell-based: Often faster than position-based (s ≤ m)
For big data: Prefer LCP/RLCP or CHI2/EUCLID, or use reference sequences instead of full matrices.
Further Reading
- Algorithm Design: Cormen, Leiserson, Rivest, & Stein - "Introduction to Algorithms"
- Sequence Analysis: Gabadinho et al. (2011) - "Analyzing and Visualizing State Sequences in R with TraMineR"
- Complexity Theory: Sipser - "Introduction to the Theory of Computation"
For implementation details, see the source code in:
/sequenzo/dissimilarity_measures/src/(C++ implementations)/sequenzo/dissimilarity_measures/get_distance_matrix.py(Python interface)
Author: Yuqi Liang