Skip to content

get_distance_matrix()

Overview

The get_distance_matrix() function is the heart of sequence comparison in Sequenzo. It takes categorical sequences (careers, family trajectories, health states, etc.) and produces a distance matrix, a table of numbers that quantify how different each sequence is from every other sequence.

Input: A SequenceData object (your sequences).
Output: A distance matrix. By default, it is an n×n DataFrame, where n is the number of sequences.

This matrix is the starting point for clustering (typologies), visualization, and regression on sequence data.

💡 New to sequence analysis? Read the dissimilarity measures guide first. It explains when to use each measure in plain language.


Architecture: Method Families at a Glance

Supported methods fall into several families. Each family answers a slightly different question about "how different" two sequences are:

FamilyMethodsFocus
Edit-based (OM family)OM, OMspell, OMspellRS, OMloc, OMslen, OMtspell, OMstran, TWEDInsert/delete/substitute operations; sequencing and timing
PositionwiseHAM, DHDPosition-by-position comparison; equal length required
Subsequence matchingLCS, NMS, NMSMST, SVRspellLongest common subsequence; ordered matching
Prefix-basedLCP, RLCP, LCPspell, RLCPspell, LCPmst, RLCPmst, LCPprod, RLCPprodCommon prefix (or suffix); early vs. late path similarity
Distribution-basedCHI2, EUCLIDTime spent in each state; ignores order

Method Cheatsheet: When to Use What

MethodWhen to useNotes
OMGeneral-purpose; you want a balance of sequencing and timingNeeds sm and optionally indel. Safe default: sm="CONSTANT", indel=1.
OMspellDurations matter; TraMineR-style spell OMOperates on spells (runs); use expcost (≥ 0; 0 ignores duration).
OMspellRSSpell OM with reference-scaled durationsLike OMspell, but duration terms are divided by duration_ref (τ) before expcost is applied; default τ = observation window T.
OMlocLocal context matters (neighboring states affect cost)Uses context and expcost.
OMslenSpell length affects substitution costUses link and h.
OMstranYou care about transitions (state changes) rather than raw statesCompares sequences of transitions.
TWEDTime-warped edit distance; elasticity in timeRequires nu (stiffness).
HAMStrict positionwise; equal-length sequencesIf sm not given, uses constant cost 1.
DHDLike HAM but costs vary by position (early vs. late)sm="TRATE" by default; builds time-varying costs.
LCP / RLCPEmphasize early-path (LCP) or late-path (RLCP) similarityNo sm or indel needed.
LCPspell / RLCPspellSame as LCP/RLCP but spell-awareUse expcost (≥ 0) and optional duration_ref (default T).
LCSLongest common subsequence; order matters, timing relaxedNo substitution costs.
NMS / NMSMST / SVRspellCount matching subsequences; SVRspell adds spell weightingMore exhaustive than LCS.
CHI2 / EUCLIDCompare "time budgets" in each state; ignore orderDistribution-based; norm can only be "auto" or "none".

Function Signature

python
get_distance_matrix(
    seqdata,              # required: SequenceData object
    method,               # required: one of the methods above
    refseq=None,          # optional: int (index) or [idx_list_A, idx_list_B]
    norm="none",          # optional: "auto", "none", "maxlength", "gmean", "maxdist", "YujianBo", "ElzingaStuder"
    indel="auto",         # for OM family: number | vector | "auto"
    sm=None,              # substitution costs: str or matrix (see below)
    full_matrix=True,     # True: n×n; False: condensed 1D for clustering
    tpow=1.0,             # OMspell, OMspellRS, NMSMST, SVRspell: spell-length exponent
    expcost=0.5,          # OMloc, OMspell, OMspellRS, LCPspell, RLCPspell: duration weight (≥ 0)
    duration_ref=None,    # OMspellRS, LCPspell, RLCPspell: reference scale τ (default: observation window T)
    weighted=True,        # use sequence weights when building sm
    check_max_size=True,  # safety check for large datasets
    matrix_display="full",# "full" | "upper" | "lower" (display only)
    opts=None,            # pass parameters as a dict
    **kwargs              # method-specific (context, nu, link, h, euclid_backend, normalization_reference_index, ...)
)

Tip: You rarely need all parameters. Pick a method, set sm/indel if required, and use norm="auto" — the function will choose sensible defaults.


Parameters in Detail

Common to All Methods

ParameterRequiredTypeDescription
seqdataSequenceDataYour state-sequence object.
methodstrOne of: OM, OMspell, OMspellRS, OMloc, OMslen, OMtspell, OMstran, TWED, HAM, DHD, LCS, LCP, RLCP, LCPspell, RLCPspell, LCPmst, RLCPmst, LCPprod, RLCPprod, NMS, NMSMST, SVRspell, CHI2, EUCLID.
refseqint or listint: index of a reference sequence; distances from all sequences to this one. list [A, B]: two index lists; returns `
normstr"auto", "none", "maxlength", "gmean", "maxdist", "YujianBo", "ElzingaStuder". "auto" picks a sensible default per method. CHI2/EUCLID only accept "auto" or "none". "ElzingaStuder" applies post-hoc theoretical normalization (Elzinga & Studer, 2019); see normalizing sequences. Not compatible with refseq=[A, B].
full_matrixboolIf True (and refseq=None): return full n×n DataFrame. If False: return condensed 1D array (for clustering). Ignored when refseq is provided.
weightedboolWhen building sm from data (e.g., "TRATE"), respect sequence weights.
check_max_sizeboolSafety check against too many unique sequences.
matrix_displaystrWhen result is full matrix: "full" (default), "upper", or "lower". Affects display only; underlying distances unchanged.
optsdictPass parameters as a bundle.
**kwargswith_missing is ignored (missing values are always included). normalization_reference_index (int): reference object for norm="ElzingaStuder" (default: refseq if single index, else 0). euclid_backend ("auto", "categorical", "dense"): EUCLID implementation. tokdep_coeff: switches OMspellOMtspell. OMstran: transindel, otto, previous, add_column.

Edit-based Measures: OM, OMspell, OMspellRS, OMloc, OMslen, OMstran, TWED

ParameterRequiredTypeDescription
sm✓ for OM, OMspell, OMspellRS, OMloc, OMslen, OMstran, TWEDstr or matrixSubstitution costs. str: "TRATE", "CONSTANT", "INDELS", "INDELSLOG", "FUTURE", "FEATURES". matrix: custom square matrix (states×states). For DHD: 3D array (time-varying).
indelnumber | vector | "auto"Insertion/deletion cost(s). Default "auto" — the function derives indel from sm automatically (e.g., half of max substitution cost when using "TRATE"). You can omit it; you do not need to specify both sm and indel manually. Vector length must match number of states (incl. missing) when passed explicitly. TWED with matrix sm: indel="auto" uses 2*max(sm)+nu+h.
tpowfloatOMspell, OMspellRS, NMSMST, SVRspell: spell-length exponent (default 1.0).
expcostfloatDuration weight λ (default 0.5, ≥ 0). For OMspell, OMspellRS, LCPspell, RLCPspell: expcost=0 removes duration-related terms. OMloc also uses expcost (with additional constraints on context). Higher λ = stronger duration sensitivity.
duration_reffloatReference scale τ for OMspellRS, LCPspell, RLCPspell. Default: observation window T (number of time positions in seqdata). Must be positive and fixed before computation. Scales duration differences as proportions of the study window (e.g. `
contextfloatOMloc only: local context (default 1 - 2*expcost).
link, hstr, floatOMslen only: link in ["mean","gmean"], h ≥ 0.
nu, hfloatTWED only: nu (stiffness) required, h (gap penalty) default 0.5.
tokdep_coeffarrayOMtspell: token-dependent coefficients (switches from OMspell when provided).

OMspell vs. OMspellRS: OMspell uses TraMineR-style spell costs. OMspellRS divides duration terms by τ (duration_ref) before applying λ (expcost), e.g. indel/del (d-1)/τ, same-state sub |d_i-d_j|/τ, different-state sub σ(i,j) + (d_i+d_j-2)/τ. Prefer OMspellRS when you want duration penalties expressed relative to a fixed observation window.

OMspell / OMspellRS practical tips:

ParameterTypical rangeAdvice
expcost0, 0.1, 0.5, 10 = ignore durations; 0.1–0.5 = moderate; 1 = strong duration sensitivity.
tpow0.5–21.0 = linear; <1 = downweight long spells; >1 = amplify long spells.
indel1–5Higher = emphasize timing; lower = allow more shifting.
sm"TRATE" or "CONSTANT""TRATE" = data-driven; "CONSTANT" (cval=2) = baseline.

Positionwise Measures: HAM, DHD

ParameterRequiredTypeDescription
sm✗ for HAM, ✓ for DHDstr or matrixHAM: If not specified, uses constant cost 1. DHD: "TRATE" or 3D array (time-varying). Note: "CONSTANT" not applicable for DHD.

Note: HAM and DHD require equal-length sequences.

Prefix-based Measures: LCP, RLCP, LCPspell, RLCPspell, LCPmst, RLCPmst, LCPprod, RLCPprod

ParameterRequiredTypeDescription
normstr"auto""gmean" for LCP/RLCP; "maxdist" for LCPspell/RLCPspell, LCPmst/RLCPmst; "none" for LCPprod/RLCPprod.
expcostfloatLCPspell/RLCPspell only: duration weight (≥ 0; 0 ignores duration).
duration_reffloatLCPspell/RLCPspell only: τ (default T). Same interpretation as for OMspellRS.
durationsarrayLCPmst, RLCPmst, LCPprod, RLCPprod: position-wise durations (default 1.0).

Note: No sm or indel needed for prefix-based measures.

Distribution-based: CHI2, EUCLID

ParameterRequiredTypeDescription
normstrOnly "auto" or "none".
step, breaks, overlapint, array, boolOptional time-window controls.
euclid_backendstrEUCLID only: "auto" (default), "categorical" (fast C++ path when data are complete equal-length sequences with step=1, no custom breaks, no overlap, no missing values), or "dense" (portable CHI2-style backend).

Default Normalization per Method (norm="auto")

MethodDefault norm
OM, HAM, DHD"maxlength"
LCS, LCP, RLCP"gmean"
LCPspell, RLCPspell, LCPmst, RLCPmst"maxdist"
LCPprod, RLCPprod"none" (raw distance; normalized values may be unstable)
OMloc, OMslen, OMspell, OMspellRS, OMtspell, OMstran, TWED, NMS, NMSMST, SVRspell"YujianBo"
CHI2, EUCLIDUses internal normalization (sqrt of n_breaks)

What the Function Does (Internal Steps)

  1. Validates inputs — Checks seqdata, method, and method-specific arguments.
  2. Builds substitution and indel costs — From sm (e.g., "TRATE", "CONSTANT") or your custom matrix. If indel="auto", derives indel from sm.
  3. Normalizes — Applies per-method normalization during C++ computation (or "auto" default). norm="ElzingaStuder" is applied afterward on the full matrix (or condensed vector).
  4. Deduplicates — Compresses to unique sequences for faster C++ computation, then expands to requested output shape.
  5. Computes distances — Uses compiled C++ backend (Python fallback for some CHI2/EUCLID cases).
  6. Handles edge cases — Empty sequences → warning (error for OMloc); refseq provided with full_matrix=False → returns full table (info printed).

Examples

1) OM with transition-rate costs (general default)

python
om = get_distance_matrix(
    seqdata=sequence_data,
    method="OM",
    sm="TRATE",
    indel="auto",
    norm="auto",
    full_matrix=True
)

2) OM with constant costs (safe baseline)

python
om = get_distance_matrix(
    seqdata=sequence_data,
    method="OM",
    sm="CONSTANT",   # substitution cost = 2
    indel=1,
    norm="auto"
)

3) OMspell (durations matter)

python
omspell = get_distance_matrix(
    seqdata=sequence_data,
    method="OMspell",
    sm="TRATE",
    indel="auto",
    tpow=1.0,
    expcost=0.5,
    norm="auto"
)

3b) OMspellRS (reference-scaled spell durations)

python
omspell_rs = get_distance_matrix(
    seqdata=sequence_data,
    method="OMspellRS",
    sm="TRATE",
    indel="auto",
    expcost=0.5,
    duration_ref=None,   # default: observation window T (e.g. 20 years)
    norm="auto"
)

4) HAM (equal-length sequences)

python
ham = get_distance_matrix(
    seqdata=sequence_data_equal_length,
    method="HAM",
    norm="auto"   # sm auto-generated with constant cost 1 if not specified
)

5) DHD (time-varying costs)

python
dhd = get_distance_matrix(
    seqdata=sequence_data_equal_length,
    method="DHD",
    sm="TRATE",
    norm="auto"
)

6) LCP and RLCP

python
lcp = get_distance_matrix(seqdata=sequence_data, method="LCP", norm="auto")
rlcp = get_distance_matrix(seqdata=sequence_data, method="RLCP", norm="auto")

7) Distances between two groups

python
idxs_A = list(range(0, 100))
idxs_B = [10, 50, 250, 400]
ab = get_distance_matrix(
    seqdata=sequence_data,
    method="OM",
    sm="TRATE",
    refseq=[idxs_A, idxs_B]   # returns |A|×|B| DataFrame
)

8) Condensed matrix (for clustering, saves memory)

python
reduced = get_distance_matrix(
    seqdata=sequence_data,
    method="OM",
    sm="TRATE",
    full_matrix=False   # returns 1D condensed array (scipy squareform format)
)

9) Display only upper triangle

python
om = get_distance_matrix(
    seqdata=sequence_data,
    method="OM",
    sm="TRATE",
    matrix_display="upper"   # cleaner display; distances unchanged
)

Tips and Common Pitfalls

  • HAM/DHD: All sequences must have the same length; otherwise you get an explicit error.
  • indel vector: If you pass a vector, its length must match the number of states (including missing).
  • OMspellRS / LCPspell / RLCPspell: Set duration_ref to the study-design observation window T, not the maximum spell duration in the data. If τ is smaller than an observed spell length, normalized distances may exceed their usual bound.
  • LCPspell/RLCPspell: Prefer norm="maxdist" or norm="none"; norm="gmean" can yield distances outside [0, 1].
  • LCPmst/RLCPmst: Prefer norm="maxdist" (auto default) or norm="none"; norm="gmean" can yield distances outside [0, 1] when sequence lengths differ.
  • LCPprod/RLCPprod: Auto default is norm="none"; other normalizations are clamped and may hide instability.
  • CHI2/EUCLID: norm can only be "auto" or "none".
  • ElzingaStuder: Requires a full pairwise matrix (refseq=None); choose normalization_reference_index deliberately (empty sequence, medoid, or index 0). See normalizing sequences.
  • with_missing: This parameter no longer exists; missing values are always included by default.

Return Value

ConditionShapeType
refseq=None, full_matrix=Truen×npandas DataFrame
refseq=None, full_matrix=FalseCondensed 1D (length u×(u-1)/2, u = unique sequences)numpy array (scipy squareform format)
refseq=[A, B]`A
refseq=intn distancespandas Series

Row and column labels come from seqdata.ids.


Authors

Code: Xinyi Li, Yuqi Liang

Documentation: Yuqi Liang

Edited by: Yuqi Liang, Yukun Ming

Acknowledgements: We gratefully acknowledge Professor Gilbert Ritschard for his helpful comments and review suggestions.

References

Studer, M., & Ritschard, G. (2016). What matters in differences between life trajectories: A comparative review of sequence dissimilarity measures. Journal of the Royal Statistical Society Series A: Statistics in Society, 179(2), 481-511.

Studer, M., & Ritschard, G. (2014). A comparative review of sequence dissimilarity measures. LIVES Working Papers, 33, 1-47.

Elzinga, C. H. (2007). Sequence analysis: Metric representations of categorical time series. Manuscript, Dept of Social Science Research Methods, Vrije Universiteit, Amsterdam.

Elzinga, C. H., & Liefbroer, A. C. (2007). De-standardization of Family-Life Trajectories of Young Adults: A Cross-National Comparison Using Sequence Analysis: Dé-standardisation des trajectoires de vie familiale des jeunes adultes: comparaison entre pays par analyse séquentielle. European Journal of Population/Revue européenne de démographie, 23(3), 225-250.

Elzinga, C. H., & Studer, M. (2015). Spell sequences, state proximities, and distance metrics. Sociological Methods & Research, 44(1), 3-47.

Biemann, T. (2011). A transition-oriented approach to optimal matching. Sociological Methodology, 41(1), 195-221.

Halpin, B. (2014). Three narratives of sequence analysis. In Advances in sequence analysis: Theory, method, applications (pp. 75-103). Cham: Springer International Publishing.

Hamming, R. W. (1950). Error detecting and error correcting codes. The Bell system technical journal, 29(2), 147-160.

Hollister, M. (2009). Is optimal matching suboptimal?. Sociological methods & research, 38(2), 235-264.

Lesnard, L. (2010). Setting cost in optimal matching to uncover contemporaneous socio-temporal patterns. Sociological methods & research, 38(3), 389-419.

Liang, Y. and J. Meyerhoff-Liang. 2026. Measuring Divergence and Convergence in Sequence Analysis: A Spell-based Extension of Longest Common Prefixes. Retrieved (osf.io/preprints/socarxiv/3pyhr_v1).