Skip to content

run_feature_extraction_and_selection_pipeline()

run_feature_extraction_and_selection_pipeline() runs the full FES workflow: spell-based feature extraction, optional control residualization, Boruta selection, and an optional exploratory final model.

Function Usage

python
run_feature_extraction_and_selection_pipeline(
    seqdata,
    outcome,
    *,
    controls=None,
    sample_weights=None,
    state_groups=None,
    problem_type=None,
    config=None,
    preset=None,
    ids=None,
    fit_final_model=None,
    verbose=True,
)

get_feature_extraction_and_selection_config_preset()

python
get_feature_extraction_and_selection_config_preset(preset)

Returns a frozen FeatureExtractionAndSelectionConfig for a named settings bundle. Currently supported: "unterlerchner2023" (Unterlerchner et al. 2023 defaults).

R / Literature Parameter Mapping

SequenzoR / packages
Feature extraction stepWeightedCluster::seqpropclust(..., prop.only=TRUE)
Boruta stepBoruta::Boruta(residuals(confounder_model) ~ ., data=features)
Optional final modellm() / glm() on selected features
preset="unterlerchner2023"Unterlerchner et al. (2023) parameterization
ResidualizationOLS/WLS (regression); binomial deviance residuals (binary classification)

Entry Parameters

ParameterRequiredTypeDescription
seqdataSequenceDataInput sequences; row count must match outcome.
outcomearray-like1D outcome vector (n elements).
controlsDataFrame / ndarray / NoneControl covariates (n rows). Used for residualization and optional final model.
sample_weightsarray-like / None1D weights of length n for WLS/GLM.
state_groupsdict / NoneState grouping for feature builders.
problem_type"regression" / "classification" / "auto" / NoneIf omitted or "auto", numeric outcomes default to regression.
configFeatureExtractionAndSelectionConfig / NoneFull configuration object.
presetstr / NoneNamed settings bundle from a published workflow (e.g. "unterlerchner2023" loads Unterlerchner et al. 2023 defaults). Mutually exclusive with config.
idssequence / NoneIndex for feature DataFrames in the result.
fit_final_modelbool / NoneOverride config.fit_final_model. Default False in presets.
verboseboolPrint progress messages.

FeatureExtractionAndSelectionConfig fields

FieldDefaultDescription
sequencing_max_k3Max subsequence length.
sequencing_min_support0.05Minimum subsequence support.
sequencing_top_mined_subsequences1000Cap on mined subsequences.
sequencing_count_method"presence"Sequencing count method.
sequencing_event_label_mode"state"Event label mode.
timing_bin_width12.0Bin width in seqdata.time units.
time_unit_hint"same_as_labels"Metadata stored in results for reproducibility and self-documentation; does not change bins.
timing_include_startTrueInclude start timing features.
timing_include_endTrueInclude end timing features.
timing_count_method"any"Timing count method.
timing_bin_include_leftTrueLeft-inclusive timing bins.
end_time_mode"last_observed""exit_time" in unterlerchner2023 preset.
boruta_n_iter50Boruta outer iterations.
boruta_perc100.0Boruta percentile threshold.
boruta_alpha0.01Boruta alpha (R pValue analogue).
boruta_two_stepFalseBorutaPy two-step mode (off for R-style path).
residualize_target_with_controlsTrueResidualize outcome on controls before Boruta.
include_controls_in_final_modelTrueAdd controls to optional final model design matrix.
fit_final_modelFalseFit exploratory OLS/WLS or logistic model after selection.

What It Returns

A dict including:

KeyDescription
problem_typeResolved problem type.
nSample size.
time_unit_hint, timing_bin_width, end_time_modeExtraction settings used.
all_feature_namesFull candidate feature names.
selected_feature_names, selected_mask, selected_indicesBoruta confirmed features.
tentative_feature_names, tentative_mask, tentative_indicesBoruta tentative features.
boruta_rankingBoruta ranking vector (ranking_ from BorutaPy).
hit_countsReserved for Boruta implementations that expose hit counts; currently None with BorutaPy.
shadow_hit_countsReserved for Boruta implementations that expose shadow hit counts; currently None with BorutaPy.
X_duration, X_timing, X_sequencing, X_fullFeature matrices as DataFrames.
X_selectedNumPy array of confirmed features only.
fit_final_model, final_model_fitted, final_model_is_exploratoryWhether a final model was requested/fitted.
final_model, y_pred, r2, bicPresent if fit_final_model=True and regression.
final_model, y_pred, accuracyPresent if fit_final_model=True and classification.

Raises RuntimeError if Boruta confirms zero features.

Example

Unterlerchner (2023) style

python
from sequenzo import run_feature_extraction_and_selection_pipeline

result = run_feature_extraction_and_selection_pipeline(
    seqdata=seqdata,
    outcome=outcome,
    controls=controls,
    preset="unterlerchner2023",
)

print(result["selected_feature_names"])

Optional exploratory final model

python
result = run_feature_extraction_and_selection_pipeline(
    seqdata=seqdata,
    outcome=outcome,
    controls=controls,
    preset="unterlerchner2023",
    fit_final_model=True,
)

print(result["r2"])

Binary outcome with classification residualization

python
result = run_feature_extraction_and_selection_pipeline(
    seqdata=seqdata,
    outcome=binary_outcome,
    controls=controls,
    problem_type="classification",
    preset="unterlerchner2023",
)

R Counterpart

  • Closest R workflow: seqpropclust(..., prop.only=TRUE) + Boruta() on residualized outcomes + optional lm() / glm()
  • Mapping note: Not one R function; papers script these steps. Boruta confirmed sets may differ from R because Sequenzo uses BorutaPy (see Conceptual Guide).

Notes

  • Provide either config or preset, not both.
  • Multi-class classification: set residualize_target_with_controls=False in a custom config.
  • Papers often cluster correlated features before interpreting a final regression—use cluster_correlated_features() rather than relying on fit_final_model=True alone.
  • Requires PyPI package boruta (installed with pip install sequenzo).
  • With BorutaPy, hit_counts and shadow_hit_counts in the result are None; use boruta_ranking, selected_*, and tentative_* instead.

Authors

Code: Yuqi Liang

Documentation: Yuqi Liang

References

Bolano, D., & Studer, M. (2020). The link between previous life trajectories and a later life outcome: A feature selection approach.

Unterlerchner, L., Studer, M., & Gomensoro, A. (2023). Back to the features. Investigating the relationship between educational pathways and income using sequence analysis and feature extraction and selection approach. Swiss journal of sociology, 49(2), 417-446.