`ClusterQuality()`: Choose the number of clusters `k` based on cluster quality indicators

Hierarchical clustering is rarely about producing “the one true tree.” The key practical question is: where should we cut the tree, i.e., how many clusters k should we retain?

ClusterQuality() provides a systematic way to answer this. It computes a panel of widely used cluster quality indicators (CQIs), such as silhouette scores (ASW and ASWw), homogeneity (HG), point-biserial correlation (PBC), pseudo R², pseudo Calinski–Harabasz (CH) and others, and helps you compare their suggestions side by side.

With these tools, you can move beyond intuition or visual inspection of dendrograms and make a more evidence-based choice of k.

Function usage

A minimal example with only the required parameter (sufficient for most use cases):

python

cluster_quality = ClusterQuality(cluster)
cluster_quality.compute_cluster_quality_scores()
cluster_quality.plot_cqi_scores()

A complete example with all available parameters (for advanced customization):

python

from sequenzo.clustering.hierarchical_clustering import Cluster, ClusterQuality

# Step 1: Fit a hierarchical cluster model
cluster = Cluster(
  matrix=distance_matrix,
  entity_ids=ids,
  clustering_method="ward"
)

# Step 2: Evaluate cluster quality
cluster_quality = ClusterQuality(
  matrix_or_cluster=cluster,   # or a square-form matrix directly
  max_clusters=20,             # evaluate up to k=20
  clustering_method="ward"     # only used if passing a matrix directly
)

# Step 3: Compute, inspect, and visualize CQIs
cluster_quality.compute_cluster_quality_scores()
table = cluster_quality.get_cqi_table()
cluster_quality.plot_cqi_scores(
  metrics_list=["ASW", "PBC", "CH"],   # optional: specify which metrics to plot
  norm="zscore",                       # z-score, range, or none
  save_as="quality.png",
  style="whitegrid"
)

Entry parameters

Parameter	Required	Type	Description
`matrix_or_cluster`	✓	Cluster or array/DataFrame	Either a `Cluster` instance (highly recommended) or an `n×n` square-form distance matrix.
`max_clusters`	✗	int	Largest `k` to evaluate (loops from `2` to `k`). Default 20.
`clustering_method`	✗	str or None	Only used when you pass a matrix directly. If None, inherit `clustering_method` from the `Cluster` instance.

What it does

Accepts either a Cluster object or a full square distance matrix.
- If you pass a Cluster (which is highly recommended), it pulls full_matrix, clustering_method, and the precomputed linkage_matrix directly.
- If you pass a matrix/DataFrame, it stores it as self.matrix and sets self.clustering_method (default "ward").
Validates that the distance matrix is square.
Prepares the input so that multiple cluster quality indicators (CQIs) can be computed and compared directly.

Returned object

A ClusterQuality instance with:

matrix: the validated full square distance matrix (NumPy array).
clustering_method: a Python str (string) indicating the linkage method.
linkage_matrix: the hierarchical linkage (present when constructed from Cluster).
max_clusters: maximum k to evaluate.
scores: a dictionary that saves the results of all quality metrics.

For each metric (ASW, PBC, etc.), it keeps a list of values, e.g., one value for k=2, one for k=3, and so on, up to max_clusters.

ASW: how well points fit within their assigned clusters (silhouette score)
ASWw: like ASW, but gives more weight to larger clusters
HG: whether clusters are balanced in size
PBC: correlation between distances and cluster labels
CH: compares separation between clusters vs. compactness within clusters
R2: proportion of overall variation explained by the clustering
HC: consistency of cluster splits in the hierarchy

How to choose the number of clusters (k)?

Prioritize ASW (silhouette) as the main indicator.
If uncertain, also inspect the raw numbers (not standardized) to understand scale. You can do so by (1) checking the legend in a normalized CQI plot as it contains raw mean and standard deviation for each CQI, or (2) setting norm="none" in plot_cqi_scores() to have a purely raw-number-based plot in details.
When you read CQI tables or plots to choose the optimal number of clusters, don’t put too much weight on those numbers.
Anchor decisions in your research questions and theories. It is very important to visualize several plausible clusters (e.g., compare how clusters look when k = 3, 4, 5 if you think that these three options are plausible) using the state distribution plot or index plot.

CQIs are guides — not the final verdict.

Details of each metric are explained in the separate CQI tutorial.

Function method 1: `compute_cluster_quality_scores()`

Compute all CQIs for k = 2 … max_clusters.

Function usage

python

cluster_quality = ClusterQuality(cluster, max_clusters=20)
cluster_quality.compute_cluster_quality_scores()

What it does

For each k, obtains labels via fcluster(linkage_matrix, k, "maxclust").
Computes and appends each CQI into self.scores.

Returns

None. Instead, the results are stored in the object itself (under self.scores), so you can access them later.

Notes

Requires self.linkage_matrix. This is automatically set when you pass a Cluster to the constructor. This means that you don't need to worry about passing this into ClusterQuality() by youself as long as you have a Cluster() instance as one of the parameters.
If you pass a distance matrix directly, you must ensure self.linkage_matrix exists before calling this method (e.g., by building it elsewhere and assigning it). Otherwise, label extraction cannot proceed.

Function method 2: `get_cqi_table()`

Summarize each metric’s optimal k and its normalized values.

Function usage

python

cluster_quality.compute_cluster_quality_scores()
table = cluster_quality.get_cqi_table()
print(table)

What it does

Keeps a temporary copy of raw scores.
Computes per-metric z-score and min-max normalizations (without overwriting raw values).
For each metric, looks across k = 2, 3, …, max_clusters and picks the k that gives the optimal score.
Returns a tidy table:

Column	Meaning
`Metric`	Metric name (ASW, ASWw, HG, PBC, CH, R2, HC).
`Opt. Clusters`	`k` that maximizes the raw statistic for that metric.
`Opt. Value`	The raw optimal value at that cluster `k`.
`Z-Score Norm.`	The z-score at that `k` (computed across the metric’s full `k` range). It is more frequently used than `Min-Max Norm`.
`Min-Max Norm.`	The [0,1] range-normalized value at that `k`.

Returns

pandas.DataFrame

Function method 3: `plot_cqi_scores(...)`

Plot multiple CQIs across k on the same chart, with normalized y-values but legend showing raw mean/std for context.

Function usage

python

fig = cluster_quality.plot_cqi_scores(
    metrics_list=None,   # or ["ASW", "PBC", "CH"]
    norm="zscore",       # "zscore" | "range" | "none"
    palette="husl",
    line_width=2,
    style="whitegrid",
    title=None,
    xlabel="Number of Clusters",
    ylabel="Normalized Score",
    grid=True,
    save_as=None,        # e.g., "quality.png"
    dpi=200,
    figsize=(12, 8),
    show=True
)

Entry parameters

Parameter	Required	Type	Description
`metrics_list`	✗	list[str] or None	Which metrics to plot. Default = all metrics present in `scores` (e.g., `["ASW","PBC","CH","R2","ASWw","HG","HC"]`).
`norm`	✗	str	Normalization applied to plotted values: `"zscore"` or `"range"` rescales lines; `"none"` plots raw values. Default = `"zscore"`.
`palette`	✗	str	Seaborn palette name used to color lines (e.g., `"husl"`, `"tab10"`, `"deep"`). Default = `"husl"`.
`line_width`	✗	int or float	Sets the stroke width of each metric line in the plot. Default = `2`.
`style`	✗	str	Seaborn style theme (`"whitegrid"`, `"darkgrid"`, `"white"`, `"dark"`, `"ticks"`). Default = `"whitegrid"`.
`title`	✗	str or None	Figure title. Default = `"Cluster Quality Metrics"`.
`xlabel`	✗	str	X-axis label. Default = `"Number of Clusters"`.
`ylabel`	✗	str	Y-axis label. Default = `"Normalized Score"`.
`grid`	✗	bool	Show grid lines on the axes. Overrides the style’s default grid behavior. Default = `True`.
`save_as`	✗	str or None	File path to save the figure (e.g., `"quality.png"`). If `None`, the plot is not saved.
`dpi`	✗	int	Resolution used when saving to file. Default = `200`.
`figsize`	✗	tuple(float,float)	Figure size in inches. Default = `(12, 8)`.
`show`	✗	bool	Whether to display the figure. If saving only, set `show=False`. Default = `True`.

Notes

The legend shows raw mean/std for each metric (computed before normalization), so readers keep scale intuition even when norm is applied.
If metrics_list is None, the method plots every metric found in self.scores.
grid takes precedence over the grid behavior implied by style.

What it does

Computes raw per-metric mean/std from unnormalized scores and uses them in the legend (so readers retain scale intuition).
Optionally standardizes each CQI’s values across different k before plotting to make visual comparison easier.
Produces a single figure and optionally writes it to disk.

Returns

The Matplotlib figure object.

Examples

1) Pick `k` with a `Cluster` object

python

cluster = Cluster(distance_matrix, ids, "ward")

cluster_quality = ClusterQuality(cluster, max_clusters=20)

cluster_quality.compute_cluster_quality_scores()
print(cluster_quality.get_cqi_table())

cluster_quality.plot_cqi_scores(
    metrics_list=["ASW", "PBC", "CH"],    #we only selected three metrics here.
    norm="zscore", 
    save_as="cqi.png"
)

2) Compare several normalizations

python

cluster_quality.plot_cqi_scores(norm="zscore", title="CQIs (z-score)")
cluster_quality.plot_cqi_scores(norm="range",  title="CQIs (min–max)")
cluster_quality.plot_cqi_scores(norm="none",   title="CQIs (raw)")

3) Matrix-only workflow (rarely used)

python

# If you already computed a linkage elsewhere:
cluster_quality = ClusterQuality(
    distance_matrix, 
    max_clusters=12, 
    clustering_method="ward"
)

cluster_quality.linkage_matrix = precomputed_linkage  # You must provide this

cluster_quality.compute_cluster_quality_scores()

cluster_quality.plot_cqi_scores(save_as="quality_avg.png")

Notes and warnings

Always prefer constructing ClusterQuality from a Cluster instance. It guarantees linkage_matrix is present.
If you pass a matrix directly, compute_cluster_quality_scores() requires self.linkage_matrix. Provide it before calling.
Silhouette/PBC assume smaller distances = greater similarity. Ensure your distance measure follows this convention.
Pseudo CH and pseudo R² are distance-based approximations; use them comparatively across k rather than as absolute benchmarks.
For very large n in data, computing labels for many k can be time-consuming. Consider narrowing max_clusters or evaluating a subset of k if needed.

Authors

Code: Yuqi Liang

Documentation: Yuqi Liang

Edited by: Yuqi Liang, Sizhu Qu

ClusterQuality(): Choose the number of clusters k based on cluster quality indicators ​

Function usage ​

Entry parameters ​

What it does ​

Returned object ​

Function method 1: compute_cluster_quality_scores() ​

Function usage ​

What it does ​

Returns ​

Notes ​

Function method 2: get_cqi_table() ​

Function usage ​

What it does ​

Returns ​

Function method 3: plot_cqi_scores(...) ​

Function usage ​

Entry parameters ​

What it does ​

Returns ​

Examples ​

1) Pick k with a Cluster object ​

2) Compare several normalizations ​

3) Matrix-only workflow (rarely used) ​

Notes and warnings ​

Authors ​

`ClusterQuality()`: Choose the number of clusters `k` based on cluster quality indicators

Function usage

Entry parameters

What it does

Returned object

Function method 1: `compute_cluster_quality_scores()`

Function usage

What it does

Returns

Notes

Function method 2: `get_cqi_table()`

Function usage

What it does

Returns

Function method 3: `plot_cqi_scores(...)`

Function usage

Entry parameters

What it does

Returns

Examples

1) Pick `k` with a `Cluster` object

2) Compare several normalizations

3) Matrix-only workflow (rarely used)

Notes and warnings

Authors