Skip to content

Cleaning time columns to pure numeric labels

Time columns like Y1, status1, pstatus15 are easier to use in Sequenzo when renamed to plain numbers: 1, 2, 15. This guide shows the simplest way first, then an optional manual method.

Sequenzo provides clean_time_columns_auto: it renames time columns to the number inside the name (e.g. status1515). Other columns (e.g. id, sex) are left unchanged.

One-line idea:
Import the function, call it on your DataFrame, optionally tell it which column names to treat as time columns (by prefix).

Basic usage

python
from sequenzo.data_preprocessing import clean_time_columns_auto

# Your data: columns like status1, status2, status3, status4
df_clean = clean_time_columns_auto(df, prefix_patterns=["status"])
# Result: those columns become 1, 2, 3, 4
  • df: your DataFrame.
  • prefix_patterns: list of prefixes. Only columns whose name starts with one of these are renamed.
    Example: ["status"] renames status1, status2, … to 1, 2, …
    Example: ["status", "pstatus"] also renames pstatus1515, etc.

If you don’t pass prefix_patterns (or pass None), the function will process any column whose name contains both letters and digits. Use prefix_patterns when you have other columns that contain numbers (e.g. id, year_birth) and you only want to rename time columns like status1, status2.

Example

python
import pandas as pd
from sequenzo.data_preprocessing import clean_time_columns_auto

df = pd.DataFrame({
    "id": [1, 2, 3],
    "status1": ["EDU", "EDU", "FT"],
    "status2": ["EDU", "UNEMP", "FT"],
    "status3": ["FT", "UNEMP", "FT"],
    "status4": ["FT", "FT", "FT"]
})

df_clean = clean_time_columns_auto(df, prefix_patterns=["status"])
print(df_clean.columns.tolist())   # ['id', '1', '2', '3', '4']

After cleaning, use df_clean with SequenceData and pass the numeric time labels (e.g. time=['1','2','3','4']). See Integrate with Sequenzo below.

Integrate with Sequenzo

After you have numeric time column names (e.g. 1, 2, 3, 4), create your sequence data like this:

python
from sequenzo import SequenceData

seq = SequenceData(
    data=df_clean,
    time=["1", "2", "3", "4"],   # match your cleaned column names
    states=states,
    labels=labels,
    id_col="id"
)

Use string time labels (e.g. "1", "2") to avoid confusion with pandas column types.

Optional: manual method (explicit column list)

If you prefer to specify exactly which columns are time columns (e.g. ["Y1","Y2","Y3","Y4"]) and rename them yourself:

python
import re
import pandas as pd

time_cols = ["Y1", "Y2", "Y3", "Y4"]
rename_map = {}
for c in time_cols:
    digits = re.sub(r"\D+", "", c)
    rename_map[c] = str(int(digits))

df_clean = df.rename(columns=rename_map)

This turns Y11, Y22, etc. Do not include non-time columns (e.g. Entity ID) in time_cols.


Code Author: Yuqi Liang

Document Author: Yuqi Liang, Liangxingyun He