linked_polyadic_sequence_analysis()
linked_polyadic_sequence_analysis() performs linked polyadic sequence analysis to examine patterns and associations within groups of linked sequences. This is useful when you have multiple sequences that belong together as a unit (a "polyad"), such as sequences from different family members (father, mother, child), couples, or other linked entities measured over the same time periods.
The function compares observed patterns within each polyad to what would be expected by chance, using randomization procedures. It computes two key statistics:
- U statistic: Measures how much lower the observed distances within polyads are compared to random expectations. Higher U values indicate stronger within-polyad similarity.
- V statistic: The proportion of randomizations in which the observed distance is smaller than the randomized distance. Values close to 1 indicate that the observed pattern is unlikely to occur by chance.
This analysis helps answer questions like: "Do family members follow similar life course trajectories?" or "Are couples synchronized in their transitions?"
Function Usage
A minimal example with only the required parameters:
result = linked_polyadic_sequence_analysis(seqlist, return_df=True)A complete example with all available parameters:
result = linked_polyadic_sequence_analysis(
seqlist, # required: list of SequenceData objects
a=1, # optional: randomization type
method="OM", # optional: distance measure method
distance_parameters=None, # optional: additional distance parameters
weights=None, # optional: sampling weights for sequences
rand_weight_type=1, # optional: randomization weight strategy
role_weights=None, # optional: role-specific weights
pair_weights=None, # optional: pairwise weights for averaging
T=1000, # optional: number of randomizations
random_seed=36963, # optional: random seed for reproducibility
replace=True, # optional: sample with replacement
n_jobs=1, # optional: number of parallel workers
verbose=True, # optional: show progress bar
return_df=True, # optional: return DataFrame format
return_merged_seqdata=False # optional: return merged sequence data
)Entry Parameters
| Parameter | Required | Type | Description |
|---|---|---|---|
seqlist | ✓ | list[SequenceData] | A list of SequenceData objects, one for each role/position in the polyad. For example, for family analysis: [father_sequences, mother_sequences, child_sequences]. All sequences must have the same number of individuals (polyads) and the same sequence length (time points). |
a | ✗ | int | Randomization type. 1 = resample sequences (keeps sequences intact but shuffles which sequences are grouped together). 2 = resample states within sequences (shuffles states within each sequence). Default = 1. |
method | ✗ | str | Distance measure method to use for computing dissimilarities. Options: "OM" (optimal matching), "HAM" (Hamming distance), "CHI2", etc. Default = "OM". |
distance_parameters | ✗ | dict or None | Dictionary of additional keyword arguments to pass to get_distance_matrix(). Can include parameters like sm, indel, norm, etc. Default = None. |
weights | ✗ | np.ndarray or None | Sampling weights for sequences when generating random polyads. Should be an array of length equal to the number of sequences (polyads). If None, uniform weights are used. Default = None. |
rand_weight_type | ✗ | int | Strategy for computing randomization weights. 1 = uniform weights. 2 = sample-weight-based weights. Default = 1. |
role_weights | ✗ | list[float] or None | Role-specific importance weights for different sequence sources. Should be a list with one weight per role, summing to 1.0. If None, equal weights are assigned. Default = None. |
pair_weights | ✗ | np.ndarray or None | Pairwise weights for distance averaging within polyads. Used when computing the average distance among all pairs within a polyad. If None, all pairs are weighted equally. Default = None. |
T | ✗ | int | Number of randomizations to perform. More randomizations give more reliable results but take longer to compute. Default = 1000. |
random_seed | ✗ | int | Seed for the random number generator to ensure reproducibility. Default = 36963. |
replace | ✗ | bool | Whether to sample with replacement during randomization. If True, the same sequence can appear multiple times in a random polyad. Default = True. |
n_jobs | ✗ | int | Number of parallel workers for randomization. Set to -1 to use all available CPUs, or 1 to run sequentially. Default = 1. |
verbose | ✗ | bool | Whether to display a progress bar during randomization. Default = True. |
return_df | ✗ | bool | If True, returns results as a pandas DataFrame with columns: ObservedDist, U, V, V>0.95. If False, returns a dictionary. At least one of return_df or return_merged_seqdata must be True. Default = False. |
return_merged_seqdata | ✗ | bool | If True, also returns the merged SequenceData object used internally. This merged object contains all sequences from all roles concatenated together, which can be useful for further analysis like clustering or visualization. At least one of return_df or return_merged_seqdata must be True. Default = False. |
What It Does
The function performs the following steps:
Validates input: Checks that all sequences in
seqlisthave the same number of individuals (polyads) and the same sequence length.Constructs merged sequence data: Combines sequences from all roles into a single sequence object for efficient distance computation. Sequences are tagged with role identifiers (e.g., "R0_1", "R1_1" for role 0, polyad 1 and role 1, polyad 1).
Computes pairwise dissimilarities: Calculates a full distance matrix between all sequences (across all roles and all polyads).
Performs randomizations: Generates
Trandom polyads by either:- Type 1 (a=1): Randomly selecting sequences from each role (preserving sequence structure)
- Type 2 (a=2): Randomly shuffling states within sequences (breaking sequence structure)
Computes observed distances: For each real polyad, calculates the average distance among all pairs of sequences within that polyad.
Computes randomized distances: For each random polyad, calculates the same average distance.
Calculates statistics:
- U: Mean randomized distance minus observed distance (higher = more similarity within polyads)
- V: Proportion of randomizations where observed distance < randomized distance (higher = more significant)
- V>0.95: Binary indicator of whether V exceeds 0.95 (conventionally used as a significance threshold)
Returns results: Depending on parameter settings, returns either a DataFrame, a dictionary, or a tuple containing results and merged sequence data.
Examples
1. Basic family analysis
Analyze whether family members (father, mother, child) follow similar trajectories:
import pandas as pd
from sequenzo.define_sequence_data import SequenceData
from sequenzo.multidomain.linked_polyad import linked_polyadic_sequence_analysis
# Load sequence data for each family member
# Assume you have three SequenceData objects already prepared
father_sequences = SequenceData(...) # Sequences for fathers
mother_sequences = SequenceData(...) # Sequences for mothers
child_sequences = SequenceData(...) # Sequences for children
# Each should have the same number of families and same time points
seqlist = [father_sequences, mother_sequences, child_sequences]
# Run analysis
result_df = linked_polyadic_sequence_analysis(
seqlist,
method="OM",
sm="TRATE",
indel="auto",
T=1000,
return_df=True
)
print(result_df.head())Output:
ObservedDist U V V>0.95
PolyadID
1 2.45 -1.23 0.85 0
2 1.98 -2.10 0.98 1
3 3.12 -0.45 0.62 0
...2. Couple analysis with custom distance parameters
Analyze couple synchronization:
# Prepare couple sequences
husband_sequences = SequenceData(...)
wife_sequences = SequenceData(...)
seqlist = [husband_sequences, wife_sequences]
# Use custom distance parameters
distance_params = {
"sm": "CONSTANT",
"indel": 1,
"norm": "maxlength"
}
result_df = linked_polyadic_sequence_analysis(
seqlist,
method="OM",
distance_parameters=distance_params,
T=2000, # More randomizations for better precision
n_jobs=-1, # Use all available CPUs
return_df=True
)
# Find couples with significant synchronization (V > 0.95)
significant_couples = result_df[result_df["V>0.95"] == 1]
print(f"Significant couples: {len(significant_couples)} / {len(result_df)}")3. Using state-level randomization
To test whether sequences are more similar than expected even after accounting for state distributions:
result_df = linked_polyadic_sequence_analysis(
seqlist,
a=2, # Resample states within sequences
method="OM",
T=1000,
return_df=True
)Type 2 randomization (a=2) preserves the overall distribution of states but breaks the sequential structure, testing whether the observed similarity is due to shared state distributions rather than shared sequences.
4. Getting merged sequence data for further analysis
If you want to perform clustering or visualization on the merged sequences:
result_df, merged_seqdata = linked_polyadic_sequence_analysis(
seqlist,
method="OM",
return_df=True,
return_merged_seqdata=True
)
# Use merged_seqdata for clustering
from sequenzo.clustering.hierarchical_clustering import Cluster
from sequenzo.dissimilarity_measures import get_distance_matrix
distance_matrix = get_distance_matrix(merged_seqdata, method="OM")
cluster_result = Cluster(matrix=distance_matrix, entity_ids=merged_seqdata.ids)
labels = cluster_result.get_cluster_labels(num_clusters=5)
# Or for visualization
from sequenzo.visualization import plot_index_plot
plot_index_plot(merged_seqdata, group=labels)5. Using weighted sampling
If your sample includes survey weights:
import numpy as np
# Assume you have survey weights for each family
family_weights = np.array([1.5, 0.8, 1.2, ...]) # One weight per family
result_df = linked_polyadic_sequence_analysis(
seqlist,
weights=family_weights,
rand_weight_type=2, # Use sample-weight-based randomization weights
return_df=True
)6. Using role weights
If different roles should have different importance:
# Give more weight to parent sequences than child sequences
role_weights = [0.4, 0.4, 0.2] # Father, Mother, Child
result_df = linked_polyadic_sequence_analysis(
seqlist,
role_weights=role_weights,
return_df=True
)7. Using pair weights
To weight some pairs within polyads more heavily:
import numpy as np
# For 3 roles, there are 3*(3-1)/2 = 3 pairs
# Weight parent-parent pairs more than parent-child pairs
pair_weights = np.array([0.5, 0.3, 0.2]) # Weights for pairs (R0-R1, R0-R2, R1-R2)
result_df = linked_polyadic_sequence_analysis(
seqlist,
pair_weights=pair_weights,
return_df=True
)Understanding the Output
DataFrame Format (when return_df=True)
The DataFrame contains one row per polyad with the following columns:
ObservedDist: The average distance among all pairs of sequences within this polyadU: Mean randomized distance minus observed distance. Positive values indicate that observed distances are smaller than expected by chance (more similarity within polyads)V: Proportion of randomizations where the observed distance was smaller than the randomized distance. Values range from 0 to 1. Values close to 1 indicate strong evidence that the observed pattern is unlikely to occur by chanceV>0.95: Binary indicator (0 or 1) showing whether V exceeds 0.95, which is often used as a significance threshold
Dictionary Format (when return_df=False)
The dictionary contains:
mean.dist: Dictionary with keys"Obs"and"Rand"showing mean observed and mean randomized distancesU: Array of U values (one per polyad)V: Array of V values (one per polyad)V.95: Binary array indicating V > 0.95 (one per polyad)observed.dist: Array of observed distances (one per polyad)random.dist: Array of randomized distances (one per randomization, length = T)
Interpreting Results
U statistic:
- Positive U: Observed distances are smaller than random expectations → more similarity within polyads than expected
- Negative U: Observed distances are larger than random expectations → less similarity within polyads than expected
- Larger positive U values indicate stronger within-polyad similarity
V statistic:
- V = 0.95 means that in 95% of randomizations, the observed distance was smaller than the randomized distance
- V values near 1 indicate strong evidence that the observed pattern is unlikely to occur by chance
- Conventionally, V > 0.95 is considered significant
Mean distances:
- Compare mean observed distance to mean randomized distance
- If mean observed < mean randomized, polyads are more similar than expected by chance
Important Notes
Data structure requirements:
- All sequences in
seqlistmust have the same number of individuals (same number of polyads) - All sequences must have the same length (same number of time points)
- The order of individuals must match across all roles (family 1 in father data should correspond to family 1 in mother data, etc.)
- All sequences in
Randomization type selection:
- a=1 (resample sequences): Tests whether observed polyads are more similar than random groupings. Preserves sequence structure but breaks real groupings.
- a=2 (resample states): Tests whether observed similarity is due to sequential patterns (beyond just state distributions). Breaks sequence structure but preserves state distributions.
Number of randomizations (T):
- More randomizations (higher T) give more reliable results but take longer to compute
- T=1000 is usually sufficient for most purposes
- For publication-quality results, consider T=5000 or more
Parallel processing:
- Set
n_jobs=-1to use all available CPUs for faster computation - Be aware that some IDEs (like PyCharm) may show errors related to multiprocessing; these are harmless and don't affect results
- Set
Reproducibility:
- Always set
random_seedif you need reproducible results - Different seeds will produce slightly different randomized distances (but conclusions should be similar)
- Always set
Return format:
- At least one of
return_dforreturn_merged_seqdatamust beTrue - Use
return_merged_seqdata=Trueif you plan to do further analysis (clustering, visualization) on the merged sequences
- At least one of
Computational considerations:
- This function can be computationally intensive, especially with many polyads and many randomizations
- Consider starting with smaller T (e.g., 100) for testing, then increasing for final analysis
- Using
n_jobs=-1significantly speeds up computation on multi-core machines
Author
Code: Yuqi Liang
Documentation: Yuqi Liang
References
Liao, T. F. (2021). Using sequence analysis to quantify how strongly life courses are linked. Sociological Science, 8, 48-72.