pseudoclass_regression()
pseudoclass_regression() implements Helske-style pseudoclass assignment: draw M hard cluster labels from a membership matrix, fit a regression in each replication, and combine coefficients with Rubin's multiple-imputation rules.
Function Usage
python
pseudoclass_regression(
y,
U,
*,
X_fixed=None,
M=20,
reference=0,
random_state=None,
model_type="ols",
add_intercept=True,
)R / Literature Parameter Mapping
| Sequenzo | R / packages | Notes |
|---|---|---|
Membership draws from U | Helske et al. (2024) pseudoclass step | Categorical assignment per replication |
| Rubin combination | Rubin (2004) multiple imputation pooling | beta_combined, se_combined, cov_combined |
model_type="ols" | stats::lm | Continuous outcome |
model_type="logit" | stats::glm(..., family=binomial) | Binary outcome |
Entry Parameters
| Parameter | Required | Type | Description |
|---|---|---|---|
y | ✓ | ndarray | Outcome vector of length n. |
U | ✓ | ndarray | Membership matrix (n, K) with rows summing to 1. Requires K >= 2. |
X_fixed | ✗ | ndarray / None | Optional fixed covariates. May be 1D (n,) or 2D (n, p); a 1D array is reshaped to (n, 1). Appended before cluster dummies. |
M | ✗ | int | Number of pseudoclass replications. Default 20. Must be >= 1. |
reference | ✗ | int | Reference cluster index (0-based) omitted when building dummies. |
random_state | ✗ | int / None | Seed for numpy.random.Generator when drawing cluster labels. |
model_type | ✗ | "ols" / "logit" | Regression model. Default "ols". |
add_intercept | ✗ | bool | If True, prepend an intercept via statsmodels.add_constant. Default True. |
What It Returns
A dict with keys:
| Key | Type | Description |
|---|---|---|
beta_combined | ndarray | Pooled coefficient vector (Rubin rules). |
se_combined | ndarray | Pooled standard errors (sqrt(diag(cov_combined))). |
cov_combined | ndarray | Pooled covariance matrix T = W + (1 + 1/m_eff) B, where W is the average within-replication covariance and B is the between-replication covariance of the coefficient estimates. |
beta_list | list | Coefficient vector from each successful replication. |
m_eff | int | Number of successful fits. |
failed | int | M - m_eff replications skipped due to rank-deficient design matrices, logit non-convergence, perfect separation, or other model-fitting errors. |
Replications that fail are skipped and counted in failed. If every replication fails, a RuntimeError is raised.
Example
python
from sequenzo import fanny_membership, pseudoclass_regression
U, _ = fanny_membership(diss, k=5, m=1.4)
result = pseudoclass_regression(
y=income,
U=U,
X_fixed=controls,
M=20,
reference=0,
model_type="ols",
random_state=42,
)
print(result["m_eff"], result["failed"])
print(result["beta_combined"])
print(result["se_combined"])R Counterpart
- Closest R workflow: manual pseudoclass draws + separate models + Rubin pooling.
- Mapping note: Not exported by WeightedCluster or TraMineR. Requires Python
statsmodels.
Notes
- Dependency:
statsmodelsmust be installed. Umust have at least two cluster columns (K >= 2).yandUmust have the same number of rows;X_fixedmust match if provided.- Helske et al. (2024) report that pseudoclass assignment often underperforms soft classification and representativeness in their simulations; treat it as a sensitivity check rather than the default.
- In each replication, cluster dummies are built directly from the drawn labels and the omitted
referencecolumn — not viacluster_labels_to_dummies().
Authors
Code: Yuqi Liang
Documentation: Yuqi Liang
References
Helske, S., Helske, J., & Chihaya, G. K. (2024). From sequences to variables: Rethinking the relationship between sequences and outcomes. Sociological Methodology, 54(1), 27–51.
Rubin, D. B. (2004). Multiple Imputation for Nonresponse in Surveys. Wiley.