Skip to content

extract_sequence_features()

extract_sequence_features() builds duration, timing, and sequencing feature matrices from SequenceData. It does not run Boruta or fit a predictive model.

Function Usage

python
extract_sequence_features(
    seqdata,
    *,
    state_groups=None,
    timing_bin_width=12.0,
    time_unit_hint="same_as_labels",
    timing_include_start=True,
    timing_include_end=True,
    timing_count_method="any",
    timing_bin_include_left=True,
    end_time_mode="last_observed",
    sequencing_max_k=3,
    sequencing_min_support=0.05,
    sequencing_top_mined_subsequences=1000,
    sequencing_count_method="presence",
    sequencing_event_label_mode="state",
    sequencing_weighted=False,
    ids=None,
)

R / Literature Parameter Mapping

SequenzoR / packagesNotes
Duration blockseqistatd(); seqpropclust(..., properties="duration")Spell-step totals
Timing blockCustom bins on spell start/end (Unterlerchner 2023)No single TraMineR equivalent
Sequencing blockseqecreateseqefsubseqeapplysub; properties="pattern"DSS spell path
Spell conversionseqdss(), seqdur(), seqformat(..., to="SPELL")Via convert_seqdata_to_spells

Entry Parameters

ParameterRequiredTypeDescription
seqdataSequenceDataInput sequences.
state_groupsdict / NoneMap group labels to lists of states. Default: one group per state.
timing_bin_widthfloatBin width in the same unit as seqdata.time (e.g. 12.0 for monthly grids, 1.0 for yearly age labels).
time_unit_hint"month" / "year" / "same_as_labels"Metadata only; stored in results for reproducibility and self-documentation. Does not change bins—set timing_bin_width explicitly.
timing_include_startboolInclude START_* timing features.
timing_include_endboolInclude END_* timing features.
timing_count_methodstrHow to count events per bin (default "any").
timing_bin_include_leftboolLeft-inclusive bin edges for timing.
end_time_mode"last_observed" / "exit_time"How spell end times are defined.
sequencing_max_kintMaximum subsequence length to mine.
sequencing_min_supportint / floatMinimum support for mined subsequences.
sequencing_top_mined_subsequencesint / NoneCap on number of mined subsequences (default 1000).
sequencing_count_methodstrSubsequence count method (default "presence").
sequencing_event_label_modestrEvent label mode (default "state").
sequencing_weightedboolWeighted mining; currently raises NotImplementedError.
idslist / NoneRow index for output DataFrames.

What It Returns

A dict with:

KeyTypeDescription
time_unit_hintstrEcho of the hint argument.
timing_bin_widthfloatBin width used.
end_time_modestrSpell end mode used.
X_durationDataFrameDuration features only.
X_timingDataFrameTiming features only.
X_sequencingDataFrameSequencing features only.
X_fullDataFrameHorizontally stacked full matrix.
all_feature_nameslistColumn names for X_full.

Example

Monthly grid (Unterlerchner-style timing)

python
from sequenzo import extract_sequence_features

features = extract_sequence_features(
    seqdata,
    timing_bin_width=12.0,
    time_unit_hint="month",
    timing_include_end=True,
    end_time_mode="exit_time",
    sequencing_max_k=3,
    sequencing_min_support=0.05,
)

print(features["X_full"].shape)
print(features["all_feature_names"][:5])

Yearly age labels

python
features = extract_sequence_features(
    seqdata,
    timing_bin_width=1.0,
    time_unit_hint="year",
    timing_include_end=True,
    end_time_mode="exit_time",
)

R Counterpart

  • Closest R bundle: WeightedCluster::seqpropclust(..., prop.only=TRUE)
  • Mapping note: Sequenzo implements duration, timing bins, and sequencing explicitly in Python rather than calling a single R function.

Notes

  • Sequencing is mined on the spell-state sequence (DSS), not the raw panel.
  • timing_bin_width=12.0 means twelve time-label units per bin—not necessarily twelve calendar months unless your grid is monthly.
  • Weighted sequencing (sequencing_weighted=True) is not wired through this entrypoint.

Authors

Code: Yuqi Liang

Documentation: Yuqi Liang

References

Bolano, D., & Studer, M. (2020). The link between previous life trajectories and a later life outcome: A feature selection approach.

Unterlerchner, L., Studer, M., & Gomensoro, A. (2023). Back to the features. Investigating the relationship between educational pathways and income using sequence analysis and feature extraction and selection approach. Swiss journal of sociology, 49(2), 417-446.