Working with Weighted Data
In some social science datasets, you will find a column called weight. This column provides survey or design weights, which adjust the contribution of each individual to the analysis. If your dataset does not include such a variable, you can safely ignore this feature. However, understanding how weights work will give you a deeper grasp of social sequence analysis.
What are weights?
In many datasets in social sciences, each row does not represent the same fraction of the population. For example:
- In a survey dataset, some respondents may carry higher sampling weights to correct for unequal probabilities of selection.
Example: In the pairfam-family dataset, the variable weight40 is provided so that trajectories at age 40 are representative of the German population. If you ignore it, some social groups may be over- or under-represented.
For instance:
id = 111000
weight40 = 0.343964278697968This does not mean the person is “one-third of a person.” Instead, Pairfam used a complex sampling design (stratification, unequal selection probabilities), and over time there is also attrition (dropout).
To make the sample representative of the German population at age 40, the survey team provides a calibration weight. The value tells us how much this person should count when computing population estimates.
Thus, if weight40 < 1 (like 0.34), this respondent comes from a group that is overrepresented in the raw sample, so their influence must be down-weighted.
If weight40 > 1 (like 1.7), this respondent comes from a group that is underrepresented, so their influence must be boosted.
In practice, using weights means that descriptive statistics and sequence distributions reflect the population structure, not just the raw sample.
In an administrative dataset, some individuals may represent many others (e.g., when rows are aggregated by groups).
- Example: If each row corresponds to a municipality’s unemployment trajectory, and the dataset also contains the size of the municipality, using that size as a weight ensures large municipalities count more than small ones.
In a simulated dataset, you may want to emphasize certain cases more than others.
- Example: Suppose you simulate 10 “rare” career paths but want them to contribute as much as 100 “common” ones. You can give the rare paths weight = 10 each, so that they have equal influence when computing descriptive statistics or substitution costs.
Overall, a weight is simply a non-negative number attached to each sequence that tells you how much that sequence should count in descriptive statistics and cost estimation.
By default, Sequenzo assumes all sequences are equally weighted (weight = 1 in SequenceData()).
Why use weights?
Weights allow your results to reflect the true population structure, not just the raw sample. For example:
- Without weights: A small oversampled subgroup may dominate the transition rates used to compute substitution costs.
- With weights: Substitution costs reflect the intended population frequencies.
This is especially important if you are analyzing survey data or data that have been stratified, clustered, or aggregated.
How to provide weights with Sequenzo
When you define your dataset, pass an array of weights:
from sequenzo import SequenceData
seq = SequenceData(
data=df,
time=list(df.columns)[1:], # Sequence columns
states=["A", "B", "C"],
id_col="ID",
weights=df["survey_weight"].values # One weight per sequence
)If no weights are given, Sequenzo automatically sets them to 1.
How weights change the sequence analysis workflow
Most of the workflow is unchanged: you still compute distances, cluster, and visualize as usual. But weights are integrated throughout Sequenzo in four main areas. Understanding how weights work in each area will help you use them effectively:
1. SequenceData Objects
How weights are stored:
The SequenceData class stores weights as an attribute that can be accessed and used throughout the analysis:
from sequenzo import SequenceData
import numpy as np
import pandas as pd
# Create sequence data with weights
data = pd.DataFrame({
'ID': [1, 2, 3],
'T1': ['A', 'B', 'A'],
'T2': ['B', 'B', 'A']
})
weights = np.array([1.5, 2.0, 0.5]) # Different weights for each sequence
seqdata = SequenceData(
data=data,
time=['T1', 'T2'],
states=['A', 'B'],
weights=weights # Pass weights during initialization
)
# Access weights
print(seqdata.weights) # [1.5, 2.0, 0.5]
# Flatten weights for time-point level analysis
flattened = seqdata.flatten_weights() # Repeats each weight for each time pointWhere weights are used in SequenceData:
- Initialization: Weights can be provided when creating a
SequenceDataobject - Property access:
seqdata.weightsreturns the weight array - Flattening:
flatten_weights()method repeats weights across sequence length for time-point level analysis - Descriptive statistics and cross-tabs:
get_xtabs(other, weighted=True)uses weights when counting co-occurrences. Weighted counts affect transition probabilities and derived summaries. - Uniqueness statistics:
check_uniqueness_rate(weighted=True)reports uniqueness rates after accounting for weights. If a rare sequence has a very large weight, it contributes proportionally.
Example:
# Create weighted sequence data
seqdata = SequenceData(
data=df,
time=['Year1', 'Year2', 'Year3'],
states=['Employed', 'Unemployed'],
weights=np.array([1.2, 0.8, 1.5, 1.0]) # 4 sequences with different weights
)
# Check weighted uniqueness
stats = seqdata.check_uniqueness_rate(weighted=True)
print(f"Weighted uniqueness rate: {stats['weighted_uniqueness_rate']}")2. Visualization Functions
How weights are used in visualization:
Most visualization functions in Sequenzo support weights through the weights="auto" parameter, which automatically uses seqdata.weights if available:
Functions with weight support:
plot_mean_time()- Weighted mean time spent in each stateplot_state_distribution()- Weighted state distributions over timeplot_modal_state()- Weighted modal state at each time pointplot_most_frequent_sequences()- Weighted sequence frequenciesplot_relative_frequency()- Weighted representative sequencesplot_sequence_index()- Sorting and grouping by weightsplot_transition_matrix()- Weighted transition rates
Usage pattern:
from sequenzo import plot_mean_time, plot_state_distribution
# Automatic weight detection (recommended)
plot_mean_time(seqdata, weights="auto") # Uses seqdata.weights if available
# Explicit weights
custom_weights = np.array([1.0, 2.0, 1.5])
plot_mean_time(seqdata, weights=custom_weights)
# No weights (unweighted)
plot_mean_time(seqdata, weights=None)How weights work in visualization:
- Plots that summarize frequencies or means (state-distribution, modal-state, most-frequent sequences, mean time, transition matrix, relative frequency plot) compute weighted tallies and divide by the total weight.
- Pure “row-rendering” plots (sequence index) do not replicate rows by weight; instead, we offer weight-aware sorting and labeling.
- Error bars for proportions use an effective sample size
to approximate .
This keeps pictures faithful to population structure when weights are present, while preserving backward-compatible behavior when weights are absent.
Downstream analysis
- Pairwise distances (OM, HAM, DHD, etc.) are not directly weighted per sequence.
- Instead, weights shape the cost matrix, which then affects distances indirectly.
For K-Medoids / PAM / PAMonce, weights are used explicitly in the objective: each point’s distance to its medoid is multiplied by its weight when computing the total cost.
In other words, with weights, the chosen medoids and the final cluster assignments can differ from the equal-weight case; with extreme weights, a few sequences may even dominate the solution.
Unlike K-Medoids, the sequence weights of hierarchical clustering are not applied during the linkage construction itself, but only in the evaluation of cluster quality. Details follow below.
Example:
# Create sequence data with weights
seqdata = SequenceData(
data=df,
time=['T1', 'T2', 'T3'],
states=['A', 'B', 'C'],
weights=np.array([1.0, 2.0, 1.5, 0.5]) # 4 sequences
)
# Visualizations automatically use weights
plot_state_distribution(seqdata, weights="auto") # Weighted distributions
plot_mean_time(seqdata, weights="auto") # Weighted mean times
plot_most_frequent_sequences(seqdata, weights="auto") # Weighted frequencies3. Distance Matrix Computation
How weights are used in distance calculations:
The get_distance_matrix() function uses weights in two main ways:
- Substitution cost matrix calculation: When
weighted=True(default), state distributions used to compute substitution costs account for sequence weights - CHI2 and EUCLID methods: These methods use weighted state distributions
Usage:
from sequenzo import get_distance_matrix
# Create sequence data with weights
seqdata = SequenceData(
data=df,
time=['T1', 'T2', 'T3'],
states=['A', 'B', 'C'],
weights=np.array([1.0, 2.0, 1.5])
)
# Compute distance matrix with weighted substitution costs
distance_matrix = get_distance_matrix(
seqdata=seqdata,
method="OM",
sm="TRATE", # Transition rates computed with weights
weighted=True # Use weights in state distribution calculations
)
# For CHI2 method, weights affect state distributions
chi2_distances = get_distance_matrix(
seqdata=seqdata,
method="CHI2",
weighted=True # Weighted chi-square distances
)How weights affect distance calculations:
Substitution cost matrices (
get_substitution_cost_matrix()):- If you choose
sm="TRATE"(transition-rate based), Sequenzo uses weighted transition rates. - This means substitution costs reflect population-weighted transitions rather than raw frequencies.
- With
sm="CONSTANT", weights have no effect.
- If you choose
Insertion/deletion costs (
indel):- When
indel="auto", its value may be derived from the substitution matrix. - If the substitution matrix was built using weights, the indel cost is indirectly weight-dependent.
- When
CHI2 distance: Uses weighted state distributions to compute chi-square statistics
EUCLID distance: Uses weighted state distributions for Euclidean distance calculation
Note: Pairwise distances (OM, HAM, DHD, etc.) are not directly weighted per sequence. Instead, weights shape the cost matrix, which then affects distances indirectly.
Example:
# Weighted substitution cost matrix
from sequenzo import get_substitution_cost_matrix
sm = get_substitution_cost_matrix(
seqdata=seqdata,
method="TRATE",
weighted=True # Transition rates account for weights
)
# Weighted distance matrix
dist = get_distance_matrix(
seqdata=seqdata,
method="OM",
sm=sm,
weighted=True
)4. Tree Analysis and Clustering
How weights are used in tree analysis:
Sequence trees (
build_sequence_tree()):- Extracts weights from
seqdata.weightswhenweighted=True - Uses weights in distance association tests and tree splitting
- Extracts weights from
Distance trees (
build_distance_tree()):- Accepts weights as parameter
- Uses weights in pseudo-variance calculations and permutation tests
Distance association (
compute_distance_association()):- Computes weighted inertia (sum of weighted squared distances)
- Formula:
SCtot = sum(w_i * w_j * d_ij)for all pairs (i,j)
Example:
from sequenzo.tree_analysis import build_sequence_tree
# Build tree with weighted analysis
tree_result = build_sequence_tree(
seqdata=seqdata,
predictors=predictors_df,
weighted=True # Uses seqdata.weights automatically
)How weights work in clustering:
- Cluster quality: Weighted cluster quality indices (ASWw, etc.)
- Medoid selection: Weighted medoid computation considers sequence weights
- Pseudo-variance: Weighted pseudo-variance for distance matrices:python
# Formula: sum(w_i * w_j * d_ij) / (sum(weights))^2
For K-Medoids / PAM / PAMonce, weights are used explicitly in the objective: each point's distance to its medoid is multiplied by its weight when computing the total cost.
In other words, with weights, the chosen medoids and the final cluster assignments can differ from the equal-weight case; with extreme weights, a few sequences may even dominate the solution.
Summary: Weight Usage Patterns
Throughout Sequenzo, weights are used in consistent patterns:
Pattern 1: Automatic weight detection (weights="auto")
# Most visualization and analysis functions support this
function(seqdata, weights="auto") # Uses seqdata.weights if availablePattern 2: Explicit weights
# Pass weights directly
function(seqdata, weights=np.array([1.0, 2.0, 1.5]))Pattern 3: Weighted parameter
# Boolean flag to enable/disable weighting
function(seqdata, weighted=True) # Uses seqdata.weights when TruePattern 4: No weights (unweighted)
# Explicitly disable weights
function(seqdata, weights=None)
# or
function(seqdata, weighted=False)Best Practices
- Store weights in SequenceData: Always provide weights when creating
SequenceDataobjects, then useweights="auto"in functions - Consistent weighting: Use the same weights throughout your analysis pipeline
- Weight normalization: Weights don't need to sum to 1, but be aware of how this affects interpretations
- Missing weights: If weights are not provided, functions default to equal weights (all sequences weighted equally)
Weighted Statistics Utilities
Now that we understand how weights are used throughout Sequenzo, let's examine the utility functions that standardize weighted statistical calculations.
Background: Previous Implementation
Before the creation of this module, weighted statistical calculations were implemented inline within various functions throughout Sequenzo. For example:
In sequenzo/sequence_characteristics/cross_sectional_indicators.py:
- The
get_mean_time_in_states()function calculated weighted means directly:python# Line 82: Inline weighted mean calculation mtime = np.sum(istatd_values * weights[:, np.newaxis], axis=0) / wtot # Lines 94-95: Inline weighted variance calculation var = np.sum(weights[:, np.newaxis] * (vcent ** 2), axis=0) * wtot / (wtot ** 2 - w2tot)
In sequenzo/visualization/plot_mean_time.py:
- The
_compute_mean_time()function usednp.average()with weights for weighted calculations
In other modules:
- Various functions implemented weighted calculations using NumPy operations directly
- Each implementation had slight variations and potential inconsistencies
- No centralized place to update or maintain weighted calculation logic
Current Implementation: Unified Utility Functions
The weighted_stats.py module provides standardized, reusable functions that match TraMineR's implementation exactly. These functions can be used throughout Sequenzo to ensure consistency and maintainability.
Why this matters:
Given that weights are used extensively across SequenceData, visualization, distance matrices, and tree analysis (as described above), having standardized weighted statistics functions ensures:
- Consistency: All weighted calculations use the same implementation
- Maintainability: Changes to weighted calculation logic only need to be made in one place
- TraMineR Compatibility: These functions match TraMineR's implementation exactly
- Code Reusability: Avoid duplicating weighted calculation code throughout the codebase
Available Functions
1. weighted_mean()
Computes the weighted mean of a vector.
Corresponds to R function: wtd.mean() in TraMineR-wtd-stats.R
Usage:
from sequenzo import weighted_mean
import numpy as np
x = np.array([1, 2, 3, 4, 5])
weights = np.array([1, 2, 1, 2, 1])
mean = weighted_mean(x, weights=weights)Parameters:
x: Input vector of valuesweights: Optional weights for each observation. If None, computes unweighted mean.normwt: Normalization flag (kept for API compatibility, but ignored)na_rm: If True, remove NA/NaN values before computation (default: True)
Returns: Weighted mean as a float
2. weighted_variance()
Computes the weighted variance of a vector.
Corresponds to R function: wtd.var() in TraMineR-wtd-stats.R
Usage:
from sequenzo import weighted_variance
import numpy as np
x = np.array([1, 2, 3, 4, 5])
weights = np.array([1, 2, 1, 2, 1])
variance = weighted_variance(x, weights=weights, method='unbiased')Parameters:
x: Input vector of valuesweights: Optional weights for each observation. If None, computes unweighted variance.normwt: If True, normalize weights so they sum to length(x) (default: False)na_rm: If True, remove NA/NaN values before computation (default: True)method: Method for variance calculation:'unbiased': Unbiased frequency weights (uses n-1 denominator) - default'ML': Maximum likelihood (uses n denominator)
Returns: Weighted variance as a float
3. weighted_five_number_summary()
Computes the weighted five-number summary (minimum, Q1, median, Q3, maximum).
Corresponds to R function: wtd.fivenum.tmr() in TraMineR-wtd-stats.R
Usage:
from sequenzo import weighted_five_number_summary
import numpy as np
x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
weights = np.array([1, 1, 2, 2, 1, 1, 2, 2, 1, 1])
fivenum = weighted_five_number_summary(x, weights=weights)
# Returns: [min, Q1, median, Q3, max]Parameters:
x: Input vector of valuesweights: Optional weights for each observation. If None, uses equal weights.na_rm: If True, remove NA/NaN values before computation (default: True)
Returns: NumPy array of length 5 containing [min, Q1, median, Q3, max]
Why Use These Functions?
Benefits
- Consistency: All weighted calculations use the same implementation, ensuring consistent results across Sequenzo
- Maintainability: Changes to weighted calculation logic only need to be made in one place
- TraMineR Compatibility: These functions match TraMineR's implementation exactly, ensuring compatibility with R-based workflows
- Code Reusability: Avoid duplicating weighted calculation code throughout the codebase
- Documentation: Centralized documentation makes it easier for users to understand weighted calculations
Migration from Inline Calculations
If you're working with code that uses inline weighted calculations, consider refactoring to use these utility functions:
Before (inline):
# Inline weighted mean
wtot = np.sum(weights)
mtime = np.sum(values * weights[:, np.newaxis], axis=0) / wtotAfter (using utility function):
from sequenzo.utils import weighted_mean
mtime = weighted_mean(values, weights=weights)TraMineR Reference
These functions are based on TraMineR's weighted statistics implementation:
- Source File:
TraMineR-wtd-stats.R - Original Package: Based on Hmisc package functions (included in TraMineR to avoid dependencies)
- GitHub: https://github.com/cran/TraMineR/blob/master/R/TraMineR-wtd-stats.R
The Python implementations match the R functions' behavior exactly, including:
- Handling of missing values
- Weight normalization options
- Variance calculation methods
- Five-number summary interpolation for unequal weights
Usage Examples
Example 1: Basic Weighted Mean
from sequenzo import weighted_mean
import numpy as np
# Sample data
values = np.array([10, 20, 30, 40, 50])
weights = np.array([1, 2, 3, 2, 1])
# Calculate weighted mean
result = weighted_mean(values, weights=weights)
print(f"Weighted mean: {result}")
# Output: Weighted mean: 30.0Example 2: Weighted Variance with Different Methods
from sequenzo import weighted_variance
import numpy as np
values = np.array([1, 2, 3, 4, 5])
weights = np.array([1, 2, 1, 2, 1])
# Unbiased variance (default)
var_unbiased = weighted_variance(values, weights=weights, method='unbiased')
# Maximum likelihood variance
var_ml = weighted_variance(values, weights=weights, method='ML')
print(f"Unbiased variance: {var_unbiased}")
print(f"ML variance: {var_ml}")Example 3: Weighted Five-Number Summary
from sequenzo import weighted_five_number_summary
import numpy as np
# Sample data with unequal weights
values = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
weights = np.array([1, 1, 2, 2, 1, 1, 2, 2, 1, 1])
# Calculate five-number summary
summary = weighted_five_number_summary(values, weights=weights)
print(f"Five-number summary: {summary}")
print(f"Min: {summary[0]}, Q1: {summary[1]}, Median: {summary[2]}, Q3: {summary[3]}, Max: {summary[4]}")Example 4: Handling Missing Values
from sequenzo import weighted_mean
import numpy as np
# Data with missing values
values = np.array([1, 2, np.nan, 4, 5])
weights = np.array([1, 2, 1, 2, 1])
# Automatically handles NaN (na_rm=True by default)
result = weighted_mean(values, weights=weights, na_rm=True)
print(f"Weighted mean (NaN removed): {result}")
# Or keep NaN
result_with_nan = weighted_mean(values, weights=weights, na_rm=False)
print(f"Weighted mean (with NaN): {result_with_nan}")Import Options
You can import these functions in several ways:
# Option 1: Direct import from sequenzo (recommended)
from sequenzo import weighted_mean, weighted_variance, weighted_five_number_summary
# Option 2: Import from utils submodule
from sequenzo.utils import weighted_mean, weighted_variance, weighted_five_number_summary
# Option 3: Wildcard import
from sequenzo import *Implementation Details
Weighted Mean Formula
The weighted mean is calculated as:
weighted_mean = sum(weights * x) / sum(weights)Weighted Variance Formula
For unbiased frequency weights:
xbar = sum(weights * x) / sum(weights)
variance = sum(weights * (x - xbar)^2) / (sum(weights) - 1)For maximum likelihood:
variance = sum(weights * (x - xbar)^2) / sum(weights)Weighted Five-Number Summary
The five-number summary uses weighted quantile interpolation for unequal weights:
- For equal weights: Uses standard fivenum positions
- For unequal weights: Uses interpolated index calculation based on cumulative weights
Notes
All functions handle edge cases such as:
- Zero weights
- All weights equal to zero
- Single observation
- Missing values (when
na_rm=True)
The functions are designed to match TraMineR's behavior exactly, ensuring compatibility with R-based sequence analysis workflows.
These functions are used internally throughout Sequenzo, but are also available for direct use by users who need weighted statistical calculations.
See Also
sequenzo/sequence_characteristics/cross_sectional_indicators.py- Example of functions that could use these utilitiessequenzo/visualization/plot_mean_time.py- Example of weighted calculations in visualizationsequenzo/define_sequence_data.py- SequenceData class with weights supportsequenzo/dissimilarity_measures/get_distance_matrix.py- Weighted distance calculationssequenzo/dissimilarity_measures/get_substitution_cost_matrix.py- Weighted substitution costs- TraMineR documentation: https://cran.r-project.org/package=TraMineR
Weighted data in hierarchical clustering
Why weights are applied to cluster quality indicators (CQIs) but not to the linkage function
In Sequenzo, weights are supported in clustering quality indicators (e.g., ASWw, R², HG) and in K-Medoids / PAM variants, but not in hierarchical linkage computation (which is under the hood of the method=ward parameter for Cluster()). This is intentional and follows both methodological and practical considerations.
1. Quality metrics are naturally weightable.
Most evaluation statistics are defined as averages over cases (e.g., average silhouette width, explained variance). Extending these to weighted versions is straightforward: each case contributes proportionally to its weight. This ensures that the assessment of clustering solutions correctly reflects sample design or population importance.
2. Hierarchical linkage is more complicated.
Many popular linkage rules, such as Ward (frequently used in sequence analysis), centroid, or median, are rooted in geometric interpretations that only hold in Euclidean feature spaces.
When applied to non-Euclidean dissimilarities such as Optimal Matching or other sequence dissimilarity distances, these formulas are already approximations. Adding weights on top does not have a clear or widely accepted definition, and may produce unstable or misleading dendrograms.
3. Limited applicability of weighted linkage.
Among the common linkage methods, only average linkage (UPGMA) can be safely extended to incorporate weights: the distance between clusters is defined as the weighted average of all pairwise dissimilarities.
For single or complete linkage, weights are irrelevant (they always use min/max). For Ward and centroid-based methods, a valid weighted version requires access to Euclidean coordinates, which are not available for general sequence distances.
4. Design choice in Sequenzo.
Because of these limitations, Sequenzo does not attempt to provide weighted linkage. Instead, users who require weighted clustering should rely on:
- evaluating hierarchical partitions using the weighted quality metrics, or
- the weighted K-Medoids / PAM / PAMonce implementations.
This approach ensures that weights are respected where the theory is sound, while avoiding methods that would give a false sense of rigor.
Weighted data in K-Medoids and its variants
Sequenzo implements K-Medoids clustering along with two common variants: PAM (Partitioning Around Medoids) and PAMonce. All three methods group sequences by minimizing the total dissimilarity between each sequence and its assigned medoid (the most representative sequence of the cluster). Unlike hierarchical clustering, which produces a full tree, K-Medoids directly returns a flat partition of the data.
How the algorithms differ
KMedoids (standard)
This version follows the classic algorithm: it repeatedly swaps candidate medoids with non-medoids until no further improvement is possible. It guarantees the optimal partition given enough passes, but can be computationally demanding on large datasets.
PAM (Partitioning Around Medoids)
PAM also seeks optimal medoids but typically explores a broader set of swaps and refinements in each iteration. It tends to give more stable results than KMedoids, but at higher computational cost.
PAMonce
This is a faster, approximate variant. Instead of repeatedly updating medoids, it performs only one refinement pass from the initial selection. PAMonce is much faster on large datasets and often yields results close to the full PAM algorithm, though sometimes at the expense of cluster quality.
Role of weights
All three methods accept a weights argument. By default, each sequence has equal weight (weights = 1). When you supply a weight vector, the clustering objective changes:
- Distances from each sequence to its assigned medoid are multiplied by its weight.
- Heavier weights make a sequence more influential when choosing medoids and forming clusters.
- In extreme cases, a single sequence with a very large weight can dominate cluster formation.
This makes weighted K-Medoids especially useful when working with survey data, stratified samples, or aggregated sequences where some units represent larger populations than others.
Practical guidance
- If you want population-representative clusters, always pass the survey or design weights.
- If you only care about the structure of your sample, you can ignore weights and treat all sequences equally.
- For large datasets, try PAMonce first for speed, and switch to PAM if you need more precise partitions.
- Use the
npassparameter to control how many random restarts are attempted. More passes increase stability at the cost of runtime.
K-Medoids / PAM / PAMonce: Weighted vs. unweighted (toy example)
Below is a minimal, self-contained example that you can run without computing sequence distances first (we use a small hand-crafted dissimilarity matrix). It demonstrates three things:
- How to call KMedoids, PAM, and PAMonce
- How to pass a weight vector
- How weights can change the chosen medoids and assignments
Setup
import numpy as np
from sequenzo.clustering.KMedoids import KMedoids
# For reproducible random initializations
np.random.seed(42)A tiny distance matrix with two natural groups
We build an 8×8 distance matrix with two tight groups:
- Group A: indices 0–3
- Group B: indices 4–7
And we place a mild cross-group separation so that, without weights, both groups are equally plausible.
# 8 objects, 2 natural groups of size 4
n = 8
D = np.zeros((n, n), dtype=float)
# Intra-group distances are small
for i in range(4):
for j in range(4):
if i != j:
D[i, j] = 0.3 # Group A (0-3)
for i in range(4, 8):
for j in range(4, 8):
if i != j:
D[i, j] = 0.3 # Group B (4-7)
# Cross-group distances are larger, but not huge
for i in range(4):
for j in range(4, 8):
D[i, j] = D[j, i] = 1.0
# Sanity: zero diagonal
np.fill_diagonal(D, 0.0)
print("Distance matrix D:\n", np.round(D, 2))Case 1: PAMonce (fast), unweighted
k = 2
labels_unweighted = KMedoids(
diss=D,
k=k,
method="PAMonce", # fast one-pass refinement
npass=10, # 10 random restarts; increase for extra stability
weights=None # None -> all ones (equal weights)
)
print("\n[PAMonce | unweighted] cluster labels:", labels_unweighted)Case 2: PAMonce, weighted
Now we give one object a very large weight, making it represent a much bigger share of the population. This often changes which medoids are optimal and can flip some assignments.
# Everyone weight=1, except object #6 (index 6) is very heavy
w = np.ones(n, dtype=float)
w[6] = 50.0
labels_weighted = KMedoids(
diss=D,
k=k,
method="PAMonce",
npass=10,
weights=w
)
print("[PAMonce | weighted] cluster labels:", labels_weighted)
print("Weights:", w)What to expect:
- Unweighted: two balanced clusters (0–3) vs. (4–7).
- Weighted: the algorithm prefers solutions that keep the heavy object (#6) close to its medoid; the chosen medoids and the final partition can shift accordingly (e.g., the medoid inside 4–7 might move closer to #6, or #6 can “pull” a borderline case across the split).
Case 3: Full PAM (more thorough), weighted
PAM explores more swaps/refinements than PAMonce. It’s slower but can yield higher-quality partitions. If you have time, compare:
labels_pam_weighted = KMedoids(
diss=D,
k=k,
method="PAM", # more thorough than PAMonce
npass=5, # a few restarts (increase for large datasets)
weights=w
)
print("[PAM | weighted] cluster labels:", labels_pam_weighted)Case 4: Classic KMedoids, weighted
This is the textbook variant; it iterates swaps until no improvement is possible (given the current run), typically between PAMonce and PAM in cost/speed profile depending on implementation details.
labels_kmedoids_weighted = KMedoids(
diss=D,
k=k,
method="KMedoids",
npass=5,
weights=w
)
print("[KMedoids | weighted] cluster labels:", labels_kmedoids_weighted)Interpreting results
- All three variants accept weights and use them in the objective (each object’s dissimilarity to its medoid is multiplied by its weight).
- If you set one or a few very large weights, those objects will strongly influence the medoid choice and the final partition.
- For large datasets, start with PAMonce for speed. If needed, validate with PAM on a subset or with fewer passes.
- If your downstream analysis requires population-representative clusters (e.g., survey data), pass the survey/design weights here. If you care only about the sample’s internal structure, you can safely omit weights.
Tip: For reproducibility, set
np.random.seed(...)and use a non-zeronpass(multiple restarts) to reduce sensitivity to initialization.
Using initial medoids (optional)
If you already have good candidate medoids (e.g., from domain knowledge or a warm start), you can pass them via initialclust:
# Suppose we want to start from medoids at indices [1, 6]
labels_from_init = KMedoids(
diss=D,
k=2,
method="PAMonce",
initialclust=[1, 6], # medoid indices (0-based)
npass=0, # skip random restarts when you trust the init
weights=w
)
print("[PAMonce | weighted | init=[1,6]] labels:", labels_from_init)When initialclust is provided as a membership vector (one label per object) or as a hierarchical linkage matrix, Sequenzo will convert it to starting medoids internally (see the docstring for details). This is handy when you want to “lift” a solution found on a subsample to the full dataset.
Author: Yuqi Liang, Translate: Sizhu Qu