Skip to content

pseudoclass_regression()

pseudoclass_regression() implements Helske-style pseudoclass assignment: draw M hard cluster labels from a membership matrix, fit a regression in each replication, and combine coefficients with Rubin's multiple-imputation rules.

Function Usage

python
pseudoclass_regression(
    y,
    U,
    *,
    X_fixed=None,
    M=20,
    reference=0,
    random_state=None,
    model_type="ols",
    add_intercept=True,
)

R / Literature Parameter Mapping

SequenzoR / packagesNotes
Membership draws from UHelske et al. (2024) pseudoclass stepCategorical assignment per replication
Rubin combinationRubin (2004) multiple imputation poolingbeta_combined, se_combined, cov_combined
model_type="ols"stats::lmContinuous outcome
model_type="logit"stats::glm(..., family=binomial)Binary outcome

Entry Parameters

ParameterRequiredTypeDescription
yndarrayOutcome vector of length n.
UndarrayMembership matrix (n, K) with rows summing to 1. Requires K >= 2.
X_fixedndarray / NoneOptional fixed covariates. May be 1D (n,) or 2D (n, p); a 1D array is reshaped to (n, 1). Appended before cluster dummies.
MintNumber of pseudoclass replications. Default 20. Must be >= 1.
referenceintReference cluster index (0-based) omitted when building dummies.
random_stateint / NoneSeed for numpy.random.Generator when drawing cluster labels.
model_type"ols" / "logit"Regression model. Default "ols".
add_interceptboolIf True, prepend an intercept via statsmodels.add_constant. Default True.

What It Returns

A dict with keys:

KeyTypeDescription
beta_combinedndarrayPooled coefficient vector (Rubin rules).
se_combinedndarrayPooled standard errors (sqrt(diag(cov_combined))).
cov_combinedndarrayPooled covariance matrix T = W + (1 + 1/m_eff) B, where W is the average within-replication covariance and B is the between-replication covariance of the coefficient estimates.
beta_listlistCoefficient vector from each successful replication.
m_effintNumber of successful fits.
failedintM - m_eff replications skipped due to rank-deficient design matrices, logit non-convergence, perfect separation, or other model-fitting errors.

Replications that fail are skipped and counted in failed. If every replication fails, a RuntimeError is raised.

Example

python
from sequenzo import fanny_membership, pseudoclass_regression

U, _ = fanny_membership(diss, k=5, m=1.4)

result = pseudoclass_regression(
    y=income,
    U=U,
    X_fixed=controls,
    M=20,
    reference=0,
    model_type="ols",
    random_state=42,
)

print(result["m_eff"], result["failed"])
print(result["beta_combined"])
print(result["se_combined"])

R Counterpart

  • Closest R workflow: manual pseudoclass draws + separate models + Rubin pooling.
  • Mapping note: Not exported by WeightedCluster or TraMineR. Requires Python statsmodels.

Notes

  • Dependency: statsmodels must be installed.
  • U must have at least two cluster columns (K >= 2).
  • y and U must have the same number of rows; X_fixed must match if provided.
  • Helske et al. (2024) report that pseudoclass assignment often underperforms soft classification and representativeness in their simulations; treat it as a sensitivity check rather than the default.
  • In each replication, cluster dummies are built directly from the drawn labels and the omitted reference column — not via cluster_labels_to_dummies().

Authors

Code: Yuqi Liang

Documentation: Yuqi Liang

References

Helske, S., Helske, J., & Chihaya, G. K. (2024). From sequences to variables: Rethinking the relationship between sequences and outcomes. Sociological Methodology, 54(1), 27–51.

Rubin, D. B. (2004). Multiple Imputation for Nonresponse in Surveys. Wiley.