Skip to content

KMedoids()

Function for center clustering (the center point of each cluster is real), supports: KMedoids, PAM (Partitioning Around Medoids), PAMonce (PAM with a only-once). The three algorithms, their principles and differences are detailed in the document: KMedoids v.s. PAM (PAMonce)待补充。All parameters of KMedoids follow WeightedCluster::KMedoids

Sequenzo also implements hierarchical clustering. For related learning materials, see the video How to interpret a hierarchical clustering dendrogram? How to combine sequence analysis with regression analysis?.

Function Usage

A minimal example with only the required parameters (sufficient for most use cases):

python
clustering = KMedoids(
   diss=om,      # dissimilarity matrix (DataFrame or ndarray)
   k=5           # number of clusters
)

A complete example with all available parameters (for advanced customization):

python
clustering = KMedoids(
    diss=om,                # dissimilarity matrix, from get_distance_matrix()
    k=5,                    # number of clusters
    method='PAMonce',       # clustering algorithm: 'KMedoids', 'PAM', or 'PAMonce'
    initialclust=None,      # optional: medoid IDs / membership matrix / linkage matrix
    npass=1,                # number of passes (repeat optimization loops)
    weights=None            # optional: weight for each sequence (defaults to 1)
)

Entry parameters

diss

That is, “dissimilarity matrix”, required, allowed data types areDataFrame, distance matrix(calculated by get_distance_matrix ).

k

Required. The allowed data type is int, the number of clusters you want to cluster.

Method

Can be passed in. The allowed data type isString, the default value is 'PAMonce', Select the clustering method to use.

Three methods are supported: KMedoids, PAM(Partitioning Around Medoids), and PAMonce(PAM with a only-once). The cluster centers obtained by the three methods are all real data points. PAM and PAMonce are variants of KMedoids. The differences between them are shown in the document [待补充]

Initialclust

That is, “initial_cluster_assignment”(which means "initialized cluster structure" or "initial cluster allocation scheme"), which can be passed in. The allowed data types are list, numpy.ndarray, and linkage matrix. The default value is None.

  • Initialized cluster structure: that is, specify the cluster center. Since the cluster center is generally ≥ 2, a list or numpy.ndarray is passed in. It indicates how many clusters are divided and what the initial center points of these clusters are.

  • Initial cluster allocation scheme: that is, membership matrix. It indicates which clusters all data points are assigned to, so the passed in is list/numpy.ndarray or linkage matrix.

Before each clustering algorithm starts, it is necessary to determine the initial center point through some strategy. If the initialclust parameter is provided, the data in it will be used directly/indirectly to get the initial center.

  • If set to None, the program randomly selects k initial medoid points from the range [1, number_of_elements].

  • If set to list/numpy.ndarray, it can be either: (1) A user-defined list of initial cluster centers (each element must be a valid data point id), The specified points will be used as the initial medoids. In this case, the number of elements in the list/numpy.ndarray must be equal to k. or (2) A membership matrix (the number of elements must match the number of data points). The program will use the weights to select one initial center from each given cluster. For how to choose, see Supplementary Details.

  • If set to linkage matrix, the program first uses hierarchical clustering via cut_tree() to obtain a membership matrix, and then selects initial medoids from the resulting clusters based on weights. See Supplementary Details for more information.

The selection scenarios for the three parameter types are as follows:

parameterscenario
NoneIf the user wants to leave initialization entirely to the program (the program by default randomly selects k data points as medoids, with no default membership matrix).
list/numpy.ndarray[Scenario 1: Initial cluster structure, i.e., specifying medoids] The user wants to manually specify the medoids as the initial centers, instead of letting the program select them randomly.
[Scenario 2: Initial cluster assignment, i.e., with membership matrix] The user already has a good membership matrix and wants the program to use it to find suitable initial medoids.
linkage matrixThe user has a linkage matrix representing the merging process of hierarchical clustering and wants the program to use it to find suitable initial medoids.

We found that specified medoids are most often represented using list, while membership matrices are more commonly represented using numpy.ndarray.

npass

This refers to “number of passes”, which can be passed as a parameter. The accepted data type is int, with a default value of 1. It determines how many times the data point assignment process will be repeated.

One data assignment process (taking PAM as an example)refers to: iterating through each cluster and attempting to find a better medoid (see Details for the evaluation criterion of “better”). A full traversal of all clusters constitutes one round of data point reassignment.

npass specifies how many times this process will be repeated.

Weights

This parameter is optional and accepts input of type list/numpy.ndarray, the default isNone, representing the weight of each data point.

If set toNone, all sequences are assigned a default weight of 1.

Return Value

pandas.Series, a membership matrix indicating which cluster each ID belongs to.

Supplementary Details

The program uses the following process to optimize the initial medoids based on weights:

  1. Based on the user-provided input, cut_tree is used to divide the data into clusters, resulting in an initial clustering result.

  2. For each data point within the same cluster, the sum of distances to all other points in that cluster is calculated.

D(x)=ixNwidiss(xi,x)
  1. The point with the smallest total distance is selected as the medoid of the cluster.

  2. Repeat this process for all clusters to determine their medoids.


In the actual execution of the algorithm, “better” refers to the following:

  1. Iterate through all data points, attempting to treat each one as a potential new medoid;

  2. Compare the new candidate medoid with the current cluster medoid: if replacing the old medoid with the new one results in a lower cost, the replacement is made.

    Loss=xallNiCjNwx(diss(i,x)diss(i,Mj))

Here, x denotes the new candidate medoid.

  1. If Loss < 0, the cost is negative, indicating that the average intra-cluster distance has decreased compared to the previous iteration. If Loss > 0, the cost is positive, indicating that the average intra-cluster distance has increased. The search stops when the loss becomes greater than 0, and the loop terminates.

Examples

The following is a usage example. The dataset comes from Gapminder and contains CO₂ emissions data for 194 countries from 1800 to 2022. This dataset is built into Sequenzo. For more details, click here

We perform clustering on the dataset and visualize the output using t-SNE to provide a more intuitive view of the clustering results.

Python
from sequenzo import *
from sequenzo.dissimilarity_measures import get_distance_matrix
from sequenzo.clustering.KMedoids import KMedoids

df = load_dataset('country_co2_emissions')

time = list(df.columns)[1:]
states = ['Very Low', 'Low', 'Middle', 'High', 'Very High']
sequence_data = SequenceData(df,time=time, id_col="country", states=states)
om = get_distance_matrix(sequence_data, method="OM", sm="TRATE", indel="auto")

centroid_indices = [0, 50, 100, 150, 190]
n_pass = 10

# Example 1: KMedoids algorithm without specifying the center point
clustering = KMedoids(diss=om,
                      k=5,
                      method='KMedoids',
                      npass=n_pass,
                      weights=None)

# Example 2: PAM algorithm with a specified center point
clustering = KMedoids(diss=om,
                      k=5,
                      method='PAM',
                      initialclust=centroid_indices,
                      npass=n_pass,
                      weights=None)

# Example 3: PAMonce algorithm with default parameters
clustering = KMedoids(diss=om,
                      k=5,
                      method='PAMonce',
                      npass=n_pass,
                      weights=None)

Output

  1. Example 1:
python
[>] SequenceData initialized successfully! Here's a summary:
[>] Number of sequences: 193
[>] Number of time points: 223
[>] Min/Max sequence length: 223 / 223
[>] States: ['Very Low', 'Low', 'Middle', 'High', 'Very High']
[>] Labels: ['Very Low', 'Low', 'Middle', 'High', 'Very High']
[>] Processing 193 sequences with 5 unique states.
[>] Transition-based substitution-cost matrix (TRATE) initiated...
  - Computing transition probabilities for: [Very Low, Low, Middle, High, Very High]
[>] Indel cost generated.

[>] Identified 175 unique sequences.
[>] Starting Optimal Matching(OM)...
[>] Computing all pairwise distances...
[>] Computed Successfully.
[>] Starting KMedoids...
[>] Computed successfully.
  1. Example 2:
python
[>] Starting Partitioning Around Medoids (PAM)...
  - PAM loop over pass number  1
[>] Computed successfully.
  1. Example 3:
python
[>] Starting Partitioning Around Medoids with a Once-Only Swap Pass (PAMonce)...
[>] Computed successfully.

Authors

Code: Xinyi Li, Cheng Deng

Documentation: Xinyi Li, Sizhu Qu

Editd by: Yuqi Liang

Translation and testing by: Sizhu Qu

Released under the BSD-3-Clause License.