SequenceData()
After you have prepared your sequence dataset through preprocessing functions (or if your data are already clean, you may skip this step), the next stage is to formally define the sequence data structure. In other words, SequenceData is the canonical entry point for representing sequences in Sequenzo.
You might ask: why is this step necessary? Think of it as similar to how pandas (a Python package for data analysis) uses a DataFrame: before you can analyze tabular data efficiently, you first need a consistent container that standardizes how rows, columns, and metadata are stored.
In the same way, SequenceData() creates a SequenceData, a dedicated data structure presenting sequences for social sequence analysis.
By doing so, it ensures that your sequences are stored in a unified format with:
- consistent state definitions and ordering,
- reproducible numeric encoding and color mapping,
- built-in methods to summarize, validate, and visualize the dataset.
This formal definition is what allows all subsequent steps, such as distance computation, clustering, and visualization, to work reliably across different datasets and projects.
Typical Workflow
- Ensure your table has one row per entity (e.g., individual / firm / region / organization) and one column per time point.
Example input DataFrame:
| Entity ID | Y1 | Y2 | Y3 | Y4 |
|---|---|---|---|---|
| 1 | EDU | EDU | FT | FT |
| 2 | EDU | UNEMP | UNEMP | FT |
| 3 | FT | FT | FT | FT |
Each row represents an individual (a sequence), and each column represents a time point.
Note:
It is recommended to clean column names during preprocessing so that time points are pure numbers (1, 2, 3, 4) instead of Y1–Y4.
Otherwise, in visualizations where the x-axis represents time, labels like Y1, Y2, Y3, Y4 will appear, which may look less clean and less intuitive than 1–4. For further instruction on how to clean your time columns in the dataframe, please refer to Clean time columns
- Provide the full, ordered list of states in the exact order you want them to appear in encodings and legends (e.g., 'Low', 'Medium', 'High' — not shuffled).
- Optionally provide an ID column for stable indexing and clustering.
- Initialize
SequenceData, then usevalues/to_numeric()for downstream algorithms orget_legend()/get_colormap()for plotting.
Function Usage
A minimal example with only the required parameters (sufficient for most use cases):
sequence = SequenceData(
data=df,
time=['1','2','3', ...], # ordered time columns
states=['EDU','FT','UNEMP'], # full, ordered state space
labels=['Education','Full-time','Unemployed'], # optional display labels; the order must correspond one-to-one with states; if labels are not set, legends will fall back to the state names
id_col='Entity ID', # ID column; if missing, create one with assign_unique_ids
)A complete example with all available parameters (for advanced customization):
sequence = SequenceData(
data=df,
time=['1','2','3', ...], # ordered time columns
states=['EDU','FT','UNEMP'], # full, ordered state space
labels=['Education','Full-time','Unemployed'], # optional display labels; the order must correspond one-to-one with states; if labels are not set, legends will fall back to the state names
id_col='Entity ID', # ID column; if missing, create one with assign_unique_ids
weights=None, # optional (defaults to 1 per row)
start=1, # start index used in summaries
custom_colors=None # optional list of colors
)Entry Parameters
| Parameter | Required | Type | Description |
|---|---|---|---|
data | ✓ | DataFrame | Input dataset with rows = entities, cols = time points. |
time | ✓ | list | Ordered list of time column names. |
states | ✓ | list | Ordered state space. Controls encoding & colors. |
labels | ✗ | list | Human-readable names, same length as states. |
id_col | ✓ | str | Column name containing unique sequence IDs. If your data lacks such a column, create one with assign_unique_ids prior to defining the sequence data. |
weights | ✗ | ndarray | Row weights. Default = all ones. |
start | ✗ | int | Starting index in summaries. Default = 1. |
custom_colors | ✗ | list | User-specified color list. Must match states. |
Note
Instead of slicing arrays to indirectly construct the time list (e.g.,
time=list(df.columns)[1:]), we recommend explicitly specifying it (e.g.,time=[str(y) for y in range(1800, 2023)]). This makes the intended time span unambiguous and prevents indexing errors.Here, summaries (the same “summaries” mentioned in the
startparameter description) refers to the dataset overview produced after initializingSequenceData, typically printed viadescribe()(and related methods). It includes state distributions, missing-value overview, sequence length, etc. See Examples below for concrete outputs. Thestartparameter sets the starting index shown in these summaries (e.g., start at 1 rather than 0).
Key rules to remember
1. Order is everything
Colors follow the order of states you pass in. If states = [1, 2, 3], the first color maps to state 1, the second to 2, etc. Keep the same states order across your whole project to keep colors consistent.
2. Labels must be strings
Labels are used in legends. If you pass non-string labels, Sequenzo will warn you.
3. Missing values get a fixed light gray by default
If your data contain missing cells and you didn’t include a dedicated “Missing” state, SequenceData will auto-add "Missing" into the list of states, and color it light gray (with the color code #cfcccc).
If you provide custom_colors with one fewer entry than the final number of states (i.e., you only colored the non-missing states), the class will append the gray for you automatically.
4. Length checks
If you provide the custom_colors parameter, its length must match either:
- the total number of states (including
"Missing"if you included it yourself), or - the number of non-missing states (SequenceData will then append gray for Missing).
5. A long list of states
By default, ≤20 states use a Spectral palette (reversed for better readability), 21-40 states use viridis, and >40 states use a combined palette (viridis + Set3 + tab20). You can override any of these by supplying custom_colors.
For further details about coloring, please refer to this documentation.
Key Features
Validation
- Ensures all
statesexist in the data. - Confirms
id_coluniqueness (if provided). - Checks
labelslength and type. - Validates
weightslength; defaults to 1 if omitted.
Missing Values
- Detects NA cells automatically.
- If your
stateslist doesn’t include"Missing"but the data contains missing values, Sequenzo will auto-add"Missing"so those cells are clearly labeled as missing. - Maps missing cells to the last integer code.
- Recommendation: explicitly include
"Missing"in yourstatesandlabels.
Encoding & Colors
States are mapped in user-provided order → integer codes 1...N (where N is the total number of states).
This order controls:
- integer encoding
- colormap assignment
- legend order
Color Management
- If
custom_colorsis given, its length must matchstates;custom_colorsis a list whose elements can be hex color codes (e.g.,#BD462D). - Otherwise, the default option is seaborn
"Spectral"(≤20 states) or"cubehelix". - Colors reversed by default for contrast.
Core Attributes
seq.states,seq.labels→ canonical state space.seq.ids→ entity IDs.seq.n_sequences→ number of sequences.seq.n_steps→ sequence length.seq.weights→ row weights (NumPy array).
Key Methods
| Method | Returns | Description |
|---|---|---|
get_colormap() | ListedColormap | Colormap aligned to codes 1...N. |
get_legend() | (handles, labels) | Prebuilt legend for plotting. |
describe() | Dataset summary with missing overview. | |
plot_legend() | figure | Renders or saves the state legend. |
Note
We list key attributes and key methods here, but you’ll rarely need to call them directly. After you initializeSequenceData(), a dataset summary is printed automatically. These are mainly for:
- inspecting details (e.g., missingness, encodings, weights);
- plotting utilities (e.g.,
get_legend()/get_colormap());- exporting data for downstream algorithms via
to_numeric()/to_dataframe().
Examples
1. Minimal Construction (with missing values)
# Create a SequenceData object
# Define the time-span variable
time_list = list(df.columns)[1:]
# We choose to use 'D1 (Very Low)', 'D10 (Very High)' as the states for readability and interpretation.
# states = ['Very Low', 'Low', 'Middle', 'High', 'Very High']
states = ['D1 (Very Low)', 'D10 (Very High)', 'D2', 'D3', 'D4', 'D5', 'D6', 'D7', 'D8', 'D9']
sequence_data = SequenceData(df,
time=time_list,
id_col="country",
states=states,
labels=states)
sequence_dataOutput:
[!] Detected missing values (empty cells) in the sequence data.
→ Automatically added 'Missing' to `states` and `labels` for compatibility.
However, it's strongly recommended to manually include it when defining `states` and `labels`.
For example:
states = ['At Home', 'Left Home', 'Missing']
labels = ['At Home', 'Left Home', 'Missing']
This ensures consistent color mapping and avoids unexpected visualization errors.
[>] SequenceData initialized successfully! Here's a summary:
[>] Number of sequences: 194
[>] Number of time points: 223
[>] Min/Max sequence length: 216 / 223
[>] There are 7 missing values across 1 sequences.
First few missing sequence IDs: ['Panama'] ...
[>] Top sequences with the most missing time points:
(Each row shows a sequence ID and its number of missing values)
Missing Count
Sequence ID
Panama 7
[>] States: ['D1 (Very Low)', 'D10 (Very High)', 'D2', 'D3', 'D4', 'D5', 'D6', 'D7', 'D8', 'D9', 'Missing']
[>] Labels: ['D1 (Very Low)', 'D10 (Very High)', 'D2', 'D3', 'D4', 'D5', 'D6', 'D7', 'D8', 'D9', 'Missing']
SequenceData(194 sequences, States: ['D1 (Very Low)', 'D10 (Very High)', 'D2', 'D3', 'D4', 'D5', 'D6', 'D7', 'D8', 'D9', 'Missing'])2. Add IDs If Missing
from sequenzo.utils import assign_unique_ids
df = assign_unique_ids(df, id_col_name='Entity ID')
sequence = SequenceData(
df,
time=year_cols,
states=states,
id_col='Entity ID'
)Author(s)
Code: Yuqi Liang
Documentation: Yuqi Liang
Edited by: Yuqi Liang, Yukun Ming, Liangxingyun He