`SequenceData()`

After you have prepared your sequence dataset through preprocessing functions (or if your data are already clean, you may skip this step), the next stage is to formally define the sequence data structure. In other words, SequenceData is the canonical entry point for representing sequences in Sequenzo.

You might ask: why is this step necessary? Think of it as similar to how pandas (a Python package for data analysis) uses a DataFrame: before you can analyze tabular data efficiently, you first need a consistent container that standardizes how rows, columns, and metadata are stored.

In the same way, SequenceData() creates a SequenceData, a dedicated data structure presenting sequences for social sequence analysis.

By doing so, it ensures that your sequences are stored in a unified format with:

consistent state definitions and ordering,
reproducible numeric encoding and color mapping,
built-in methods to summarize, validate, and visualize the dataset.

This formal definition is what allows all subsequent steps, such as distance computation, clustering, and visualization, to work reliably across different datasets and projects.

Typical Workflow

Ensure your table has one row per entity (e.g., individual / firm / region / organization) and one column per time point.
Example input DataFrame:

Entity ID	Y1	Y2	Y3	Y4
1	EDU	EDU	FT	FT
2	EDU	UNEMP	UNEMP	FT
3	FT	FT	FT	FT

Each row represents an individual (a sequence), and each column represents a time point.

Note:

It is recommended to clean column names during preprocessing so that time points are pure numbers (1, 2, 3, 4) instead of Y1–Y4.
Otherwise, in visualizations where the x-axis represents time, labels like Y1, Y2, Y3, Y4 will appear, which may look less clean and less intuitive than 1–4. For further instruction on how to clean your time columns in the dataframe, please refer to Clean time columns

Provide the full, ordered list of states in the exact order you want them to appear in encodings and legends (e.g., 'Low', 'Medium', 'High' — not shuffled).
Optionally provide an ID column for stable indexing and clustering.
Initialize SequenceData, then use values / to_numeric() for downstream algorithms or get_legend() / get_colormap() for plotting.

Function Usage

A minimal example with only the required parameters (sufficient for most use cases):

python

sequence = SequenceData(
    data=df,
    time=['1','2','3', ...],        # ordered time columns
    states=['EDU','FT','UNEMP'],    # full, ordered state space
    labels=['Education','Full-time','Unemployed'],  # optional display labels; the order must correspond one-to-one with states; if labels are not set, legends will fall back to the state names
    id_col='Entity ID',             # ID column; if missing, create one with assign_unique_ids
)

A complete example with all available parameters (for advanced customization):

python

sequence = SequenceData(
    data=df,
    time=['1','2','3', ...],        # ordered time columns
    states=['EDU','FT','UNEMP'],    # full, ordered state space
    labels=['Education','Full-time','Unemployed'],  # optional display labels; the order must correspond one-to-one with states; if labels are not set, legends will fall back to the state names
    id_col='Entity ID',             # ID column; if missing, create one with assign_unique_ids
    weights=None,                   # optional (defaults to 1 per row)
    start=1,                        # start index used in summaries
    custom_colors=None              # optional list of colors
)

Entry Parameters

Parameter	Required	Type	Description
`data`	✓	DataFrame	Input dataset with rows = entities, cols = time points.
`time`	✓	list	Ordered list of time column names.
`states`	✓	list	Ordered state space. Controls encoding & colors.
`labels`	✗	list	Human-readable names, same length as `states`.
`id_col`	✓	str	Column name containing unique sequence IDs. If your data lacks such a column, create one with `assign_unique_ids` prior to defining the sequence data.
`weights`	✗	ndarray	Row weights. Default = all ones.
`start`	✗	int	Starting index in summaries. Default = 1.
`custom_colors`	✗	list	User-specified color list. Must match `states`.

Note
Instead of slicing arrays to indirectly construct the time list (e.g., time=list(df.columns)[1:]), we recommend explicitly specifying it (e.g., time=[str(y) for y in range(1800, 2023)]). This makes the intended time span unambiguous and prevents indexing errors.
Here, summaries (the same “summaries” mentioned in the start parameter description) refers to the dataset overview produced after initializing SequenceData, typically printed via describe() (and related methods). It includes state distributions, missing-value overview, sequence length, etc. See Examples below for concrete outputs. The start parameter sets the starting index shown in these summaries (e.g., start at 1 rather than 0).

Key rules to remember

1. Order is everything

Colors follow the order of states you pass in. If states = [1, 2, 3], the first color maps to state 1, the second to 2, etc. Keep the same states order across your whole project to keep colors consistent.

2. Labels must be strings

Labels are used in legends. If you pass non-string labels, Sequenzo will warn you.

3. Missing values get a fixed light gray by default

If your data contain missing cells and you didn’t include a dedicated “Missing” state, SequenceData will auto-add "Missing" into the list of states, and color it light gray (with the color code #cfcccc).

If you provide custom_colors with one fewer entry than the final number of states (i.e., you only colored the non-missing states), the class will append the gray for you automatically.

4. Length checks

If you provide the custom_colors parameter, its length must match either:

the total number of states (including "Missing" if you included it yourself), or
the number of non-missing states (SequenceData will then append gray for Missing).

5. A long list of states

By default, ≤20 states use a Spectral palette (reversed for better readability), 21-40 states use viridis, and >40 states use a combined palette (viridis + Set3 + tab20). You can override any of these by supplying custom_colors.

For further details about coloring, please refer to this documentation.

Key Features

Validation

Ensures all states exist in the data.
Confirms id_col uniqueness (if provided).
Checks labels length and type.
Validates weights length; defaults to 1 if omitted.

Missing Values

Detects NA cells automatically.
If your states list doesn’t include "Missing" but the data contains missing values, Sequenzo will auto-add "Missing" so those cells are clearly labeled as missing.
Maps missing cells to the last integer code.
Recommendation: explicitly include "Missing" in your states and labels.

Encoding & Colors

States are mapped in user-provided order → integer codes 1...N (where N is the total number of states).
This order controls:
- integer encoding
- colormap assignment
- legend order

Color Management

If custom_colors is given, its length must match states; custom_colors is a list whose elements can be hex color codes (e.g., #BD462D).
Otherwise, the default option is seaborn "Spectral" (≤20 states) or "cubehelix".
Colors reversed by default for contrast.

Core Attributes

seq.states, seq.labels → canonical state space.
seq.ids → entity IDs.
seq.n_sequences → number of sequences.
seq.n_steps → sequence length.
seq.weights → row weights (NumPy array).

Key Methods

Method	Returns	Description
`get_colormap()`	ListedColormap	Colormap aligned to codes 1...N.
`get_legend()`	(handles, labels)	Prebuilt legend for plotting.
`describe()`	print	Dataset summary with missing overview.
`plot_legend()`	figure	Renders or saves the state legend.

Note
We list key attributes and key methods here, but you’ll rarely need to call them directly. After you initialize SequenceData(), a dataset summary is printed automatically. These are mainly for:
inspecting details (e.g., missingness, encodings, weights);
plotting utilities (e.g., get_legend() / get_colormap());
exporting data for downstream algorithms via to_numeric() / to_dataframe().

Examples

1. Minimal Construction (with missing values)

python

# Create a SequenceData object

# Define the time-span variable
time_list = list(df.columns)[1:]

# We choose to use 'D1 (Very Low)', 'D10 (Very High)' as the states for readability and interpretation. 
# states = ['Very Low', 'Low', 'Middle', 'High', 'Very High']
states = ['D1 (Very Low)', 'D10 (Very High)', 'D2', 'D3', 'D4', 'D5', 'D6', 'D7', 'D8', 'D9']

sequence_data = SequenceData(df, 
                             time=time_list,  
                             id_col="country", 
                             states=states,
                             labels=states)

sequence_data

Output:

python

[!] Detected missing values (empty cells) in the sequence data.
    → Automatically added 'Missing' to `states` and `labels` for compatibility.
    However, it's strongly recommended to manually include it when defining `states` and `labels`.
    For example:

        states = ['At Home', 'Left Home', 'Missing']
        labels = ['At Home', 'Left Home', 'Missing']

    This ensures consistent color mapping and avoids unexpected visualization errors.

[>] SequenceData initialized successfully! Here's a summary:
[>] Number of sequences: 194
[>] Number of time points: 223
[>] Min/Max sequence length: 216 / 223
[>] There are 7 missing values across 1 sequences.
    First few missing sequence IDs: ['Panama'] ...
[>] Top sequences with the most missing time points:
    (Each row shows a sequence ID and its number of missing values)

             Missing Count
Sequence ID               
Panama                   7
[>] States: ['D1 (Very Low)', 'D10 (Very High)', 'D2', 'D3', 'D4', 'D5', 'D6', 'D7', 'D8', 'D9', 'Missing']
[>] Labels: ['D1 (Very Low)', 'D10 (Very High)', 'D2', 'D3', 'D4', 'D5', 'D6', 'D7', 'D8', 'D9', 'Missing']
SequenceData(194 sequences, States: ['D1 (Very Low)', 'D10 (Very High)', 'D2', 'D3', 'D4', 'D5', 'D6', 'D7', 'D8', 'D9', 'Missing'])

2. Add IDs If Missing

python

from sequenzo.utils import assign_unique_ids
df = assign_unique_ids(df, id_col_name='Entity ID')

sequence = SequenceData(
    df,
    time=year_cols,
    states=states,
    id_col='Entity ID'
)

Author(s)

Code: Yuqi Liang

Documentation: Yuqi Liang

Edited by: Yuqi Liang, Yukun Ming, Liangxingyun He

SequenceData() ​

Typical Workflow ​

Function Usage ​

Entry Parameters ​

Key rules to remember ​

1. Order is everything ​

2. Labels must be strings ​

3. Missing values get a fixed light gray by default ​

4. Length checks ​

5. A long list of states ​

Key Features ​

Validation ​

Missing Values ​

Encoding & Colors ​

Color Management ​

Core Attributes ​

Key Methods ​

Examples ​

1. Minimal Construction (with missing values) ​

2. Add IDs If Missing ​

Author(s) ​

`SequenceData()`

Typical Workflow

Function Usage

Entry Parameters

Key rules to remember

1. Order is everything

2. Labels must be strings

3. Missing values get a fixed light gray by default

4. Length checks

5. A long list of states

Key Features

Validation

Missing Values

Encoding & Colors

Color Management

Core Attributes

Key Methods

Examples

1. Minimal Construction (with missing values)

2. Add IDs If Missing

Author(s)