distance_tree()
distance_tree() builds a regression tree that partitions the sample into subgroups with lower within-node sequence discrepancy. Each split is chosen to maximize pseudo-R² and is kept only when a pseudo-F permutation test supports it.
Function Usage
distance_tree(
distance_matrix,
predictors,
weights=None,
min_size=0.05,
max_depth=5,
R=1000,
pval=0.01,
weight_permutation=None,
squared=False,
first_split=None
)TraMineR Parameter Mapping
distance_matrix-> TraMineRdisspredictors-> TraMineRformulapredictorsweights-> TraMineRweightsmin_size-> TraMineRmin.sizemax_depth-> TraMineRmaxdepthR-> TraMineRRpval-> TraMineRpvalweight_permutation-> TraMineRweight.permsquared-> TraMineRsquaredfirst_split-> TraMineRfirst.split
Entry Parameters
| Parameter | Required | Type | Description |
|---|---|---|---|
distance_matrix | ✓ | np.ndarray / pd.DataFrame | Square symmetric distance matrix with shape (n, n). |
predictors | ✓ | pd.DataFrame | Covariates with one row per sequence and one column per predictor. |
weights | ✗ | np.ndarray | Optional sequence weights with shape (n,). If omitted, equal weights are used. |
min_size | ✗ | float / int | Minimum node size. Values below 1 are treated as a fraction of total weight. Default: 0.05. |
max_depth | ✗ | int | Maximum tree depth. Default: 5. |
R | ✗ | int | Number of permutations for split significance. Default: 1000. |
pval | ✗ | float | Maximum p-value required to keep a split. Default: 0.01. |
weight_permutation | ✗ | str / None | Permutation mode: "replicate", "diss", "group", or "none". Default: None (resolved to "none" without weights, otherwise "replicate"). |
squared | ✗ | bool | If True, use exponent v = 2 on dissimilarities before tree fitting. Default: False (v = 1). |
first_split | ✗ | str | Optional predictor name forced at the root split. |
What It Returns
A dictionary with the fitted tree and supporting metadata.
| Key | Type | Description |
|---|---|---|
root | DissTreeNode | Root node of the fitted tree. |
fitted | pd.DataFrame | Leaf membership for each sequence in column (fitted). |
info | dict | Method name, sample size, tree parameters, global adjustment statistics, and permutation settings. |
data | pd.DataFrame | Copy of the predictor data frame used for fitting. |
weights | np.ndarray | Weights used during fitting. |
The info["adjustment"] entry stores a global single_factor_association() result computed on the final leaf labels. Use it as a compact summary of how well the tree partitions total discrepancy.
Examples
Step 1: Compute a distance matrix
import pandas as pd
from sequenzo import SequenceData, load_dataset
from sequenzo.dissimilarity_measures import get_distance_matrix
df = load_dataset("mvad")
time_list = [c for c in df.columns if str(c).isdigit()]
seqdata = SequenceData(df, time=time_list, states=sorted(df[time_list].stack().unique()))
dist = get_distance_matrix(seqdata=seqdata, method="LCS", norm="auto")Step 2: Prepare predictors
predictors = df[["male", "fmpr", "emp97"]].copy()Step 3: Fit the tree
from sequenzo.discrepancy_analysis import distance_tree
tree = distance_tree(
distance_matrix=dist,
predictors=predictors,
R=1000,
pval=0.05,
max_depth=4,
)Step 4: Inspect leaves and rules
from sequenzo.discrepancy_analysis import (
get_leaf_membership,
get_classification_rules,
print_tree,
plot_tree,
)
print_tree(tree)
leaf_ids = get_leaf_membership(tree)
rules = get_classification_rules(tree)
plot_tree(tree, filename="distance_tree.png")R Counterpart
- Closest R function:
disstree - Mapping note: Sequenzo uses the same pseudo-R² split criterion, medoid labeling, and permutation-gated splitting strategy as the TraMineR distance-tree workflow.
Notes
predictorsmust contain exactly one row per sequence represented indistance_matrix.- If
R <= 1, split retention is effectively permissive because no permutation threshold is applied. min_sizeis interpreted on total weight, not only raw row count.- The tree is binary because pseudo-R² does not penalize the number of groups.
- Use
export_tree_to_dot()when you need a Graphviz representation. - Use
assign_to_leaves()when you want to classify new rows with the fitted tree rules. - By default, Sequenzo uses nonsquared dissimilarities (v = 1). Set
squared=Truemainly when the dissimilarity is Euclidean, or as a sensitivity check. - Use
weight_permutation="diss"for survey or calibration weights. The default"replicate"matches TraMineR and is appropriate only for integer frequency weights.
Authors
Code: Yuqi Liang
Documentation: Yuqi Liang
References
Studer, M., Ritschard, G., Gabadinho, A., & Müller, N. S. (2011). Discrepancy analysis of state sequences. Sociological Methods & Research, 40(3), 471–510.
Batagelj, V. (1988). Generalized Ward and related clustering problems. In H. H. Bock (Ed.), Classification and Related Methods of Data Analysis (pp. 67–74). North-Holland.