Skip to content

SequenceData()

After you have prepared your sequence dataset through preprocessing functions (or if your data are already clean, you may skip this step), the next stage is to formally define the sequence data structure. In other words, SequenceData is the canonical entry point for representing sequences in Sequenzo.

You might ask: why is this step necessary? Think of it as similar to how pandas (a Python package for data analysis) uses a DataFrame: before you can analyze tabular data efficiently, you first need a consistent container that standardizes how rows, columns, and metadata are stored.

In the same way, SequenceData() creates a SequenceData, a dedicated data structure presenting sequences for social sequence analysis.

By doing so, it ensures that your sequences are stored in a unified format with:

  • consistent state definitions and ordering,
  • reproducible numeric encoding and color mapping,
  • built-in methods to summarize, validate, and visualize the dataset.

This formal definition is what allows all subsequent steps, such as distance computation, clustering, and visualization, to work reliably across different datasets and projects.

Typical Workflow

  1. Ensure your table has one row per entity (e.g., individual / firm / region / organization) and one column per time point.
    Example input DataFrame:
Entity IDY1Y2Y3Y4
1EDUEDUFTFT
2EDUUNEMPUNEMPFT
3FTFTFTFT

Each row represents an individual (a sequence), and each column represents a time point.

Note:

It is recommended to clean column names during preprocessing so that time points are pure numbers (1, 2, 3, 4) instead of Y1–Y4.
Otherwise, in visualizations where the x-axis represents time, labels like Y1, Y2, Y3, Y4 will appear, which may look less clean and less intuitive than 1–4. For further instruction on how to clean your time columns in the dataframe, please refer to Clean time columns

  1. Provide the full, ordered list of states in the exact order you want them to appear in encodings and legends (e.g., 'Low', 'Medium', 'High' — not shuffled).
  2. Optionally provide an ID column for stable indexing and clustering.
  3. Initialize SequenceData, then use values / to_numeric() for downstream algorithms or get_legend() / get_colormap() for plotting.

Function Usage

A minimal example with only the required parameters (sufficient for most use cases):

python
sequence = SequenceData(
    data=df,
    time=['1','2','3', ...],        # ordered time columns
    states=['EDU','FT','UNEMP'],    # full, ordered state space
    labels=['Education','Full-time','Unemployed'],  # optional display labels; the order must correspond one-to-one with states; if labels are not set, legends will fall back to the state names
    id_col='Entity ID',             # ID column; if missing, create one with assign_unique_ids
)

A complete example with all available parameters (for advanced customization):

python
sequence = SequenceData(
    data=df,
    time=['1','2','3', ...],        # ordered time columns
    states=['EDU','FT','UNEMP'],    # full, ordered state space
    labels=['Education','Full-time','Unemployed'],  # optional display labels; the order must correspond one-to-one with states; if labels are not set, legends will fall back to the state names
    id_col='Entity ID',             # ID column; if missing, create one with assign_unique_ids
    weights=None,                   # optional (defaults to 1 per row)
    start=1,                        # start index used in summaries
    custom_colors=None              # optional list of colors
)

Entry Parameters

ParameterRequiredTypeDescription
dataDataFrameInput dataset with rows = entities, cols = time points.
timelistOrdered list of time column names.
stateslistOrdered state space. Controls encoding & colors.
labelslistHuman-readable names, same length as states.
id_colstrColumn name containing unique sequence IDs. If your data lacks such a column, create one with assign_unique_ids prior to defining the sequence data.
weightsndarrayRow weights. Default = all ones.
startintStarting index in summaries. Default = 1.
custom_colorslistUser-specified color list. Must match states.

Note

  1. Instead of slicing arrays to indirectly construct the time list (e.g., time=list(df.columns)[1:]), we recommend explicitly specifying it (e.g., time=[str(y) for y in range(1800, 2023)]). This makes the intended time span unambiguous and prevents indexing errors.

  2. Here, summaries (the same “summaries” mentioned in the start parameter description) refers to the dataset overview produced after initializing SequenceData, typically printed via describe() (and related methods). It includes state distributions, missing-value overview, sequence length, etc. See Examples below for concrete outputs. The start parameter sets the starting index shown in these summaries (e.g., start at 1 rather than 0).

Key rules to remember

1. Order is everything

Colors follow the order of states you pass in. If states = [1, 2, 3], the first color maps to state 1, the second to 2, etc. Keep the same states order across your whole project to keep colors consistent.

2. Labels must be strings

Labels are used in legends. If you pass non-string labels, Sequenzo will warn you.

3. Missing values get a fixed light gray by default

If your data contain missing cells and you didn’t include a dedicated “Missing” state, SequenceData will auto-add "Missing" into the list of states, and color it light gray (with the color code #cfcccc).

If you provide custom_colors with one fewer entry than the final number of states (i.e., you only colored the non-missing states), the class will append the gray for you automatically.

4. Length checks

If you provide the custom_colors parameter, its length must match either:

  • the total number of states (including "Missing" if you included it yourself), or
  • the number of non-missing states (SequenceData will then append gray for Missing).

5. A long list of states

By default, ≤20 states use a Spectral palette (reversed for better readability), 21-40 states use viridis, and >40 states use a combined palette (viridis + Set3 + tab20). You can override any of these by supplying custom_colors.

For further details about coloring, please refer to this documentation.

Key Features

Validation

  • Ensures all states exist in the data.
  • Confirms id_col uniqueness (if provided).
  • Checks labels length and type.
  • Validates weights length; defaults to 1 if omitted.

Missing Values

  • Detects NA cells automatically.
  • If your states list doesn’t include "Missing" but the data contains missing values, Sequenzo will auto-add "Missing" so those cells are clearly labeled as missing.
  • Maps missing cells to the last integer code.
  • Recommendation: explicitly include "Missing" in your states and labels.

Encoding & Colors

  • States are mapped in user-provided order → integer codes 1...N (where N is the total number of states).

  • This order controls:

    • integer encoding
    • colormap assignment
    • legend order

Color Management

  • If custom_colors is given, its length must match states; custom_colors is a list whose elements can be hex color codes (e.g., #BD462D).
  • Otherwise, the default option is seaborn "Spectral" (≤20 states) or "cubehelix".
  • Colors reversed by default for contrast.

Core Attributes

  • seq.states, seq.labels → canonical state space.
  • seq.ids → entity IDs.
  • seq.n_sequences → number of sequences.
  • seq.n_steps → sequence length.
  • seq.weights → row weights (NumPy array).

Key Methods

MethodReturnsDescription
get_colormap()ListedColormapColormap aligned to codes 1...N.
get_legend()(handles, labels)Prebuilt legend for plotting.
describe()printDataset summary with missing overview.
plot_legend()figureRenders or saves the state legend.

Note
We list key attributes and key methods here, but you’ll rarely need to call them directly. After you initialize SequenceData(), a dataset summary is printed automatically. These are mainly for:

  1. inspecting details (e.g., missingness, encodings, weights);
  2. plotting utilities (e.g., get_legend() / get_colormap());
  3. exporting data for downstream algorithms via to_numeric() / to_dataframe().

Examples

1. Minimal Construction (with missing values)

python
# Create a SequenceData object

# Define the time-span variable
time_list = list(df.columns)[1:]

# We choose to use 'D1 (Very Low)', 'D10 (Very High)' as the states for readability and interpretation. 
# states = ['Very Low', 'Low', 'Middle', 'High', 'Very High']
states = ['D1 (Very Low)', 'D10 (Very High)', 'D2', 'D3', 'D4', 'D5', 'D6', 'D7', 'D8', 'D9']

sequence_data = SequenceData(df, 
                             time=time_list,  
                             id_col="country", 
                             states=states,
                             labels=states)

sequence_data

Output:

python
[!] Detected missing values (empty cells) in the sequence data.
    → Automatically added 'Missing' to `states` and `labels` for compatibility.
    However, it's strongly recommended to manually include it when defining `states` and `labels`.
    For example:

        states = ['At Home', 'Left Home', 'Missing']
        labels = ['At Home', 'Left Home', 'Missing']

    This ensures consistent color mapping and avoids unexpected visualization errors.

[>] SequenceData initialized successfully! Here's a summary:
[>] Number of sequences: 194
[>] Number of time points: 223
[>] Min/Max sequence length: 216 / 223
[>] There are 7 missing values across 1 sequences.
    First few missing sequence IDs: ['Panama'] ...
[>] Top sequences with the most missing time points:
    (Each row shows a sequence ID and its number of missing values)

             Missing Count
Sequence ID               
Panama                   7
[>] States: ['D1 (Very Low)', 'D10 (Very High)', 'D2', 'D3', 'D4', 'D5', 'D6', 'D7', 'D8', 'D9', 'Missing']
[>] Labels: ['D1 (Very Low)', 'D10 (Very High)', 'D2', 'D3', 'D4', 'D5', 'D6', 'D7', 'D8', 'D9', 'Missing']
SequenceData(194 sequences, States: ['D1 (Very Low)', 'D10 (Very High)', 'D2', 'D3', 'D4', 'D5', 'D6', 'D7', 'D8', 'D9', 'Missing'])

2. Add IDs If Missing

python
from sequenzo.utils import assign_unique_ids
df = assign_unique_ids(df, id_col_name='Entity ID')

sequence = SequenceData(
    df,
    time=year_cols,
    states=states,
    id_col='Entity ID'
)

Author(s)

Code: Yuqi Liang

Documentation: Yuqi Liang

Edited by: Yuqi Liang, Yukun Ming, Liangxingyun He

Released under the BSD-3-Clause License.