create_idcd_sequence_from_csvs()
create_idcd_sequence_from_csvs() creates a multidomain sequence dataset using the IDCD (Individual Domain Combined Data) strategy. This approach combines sequence data from multiple CSV files (each representing one domain) into a single multidomain sequence object by creating composite states from observed combinations.
Unlike CAT or DAT which work with separate domain sequences and combine them during distance computation, IDCD creates the combined sequences upfront. Each time point in the resulting sequences represents a combination of states from all domains (e.g., "Employed+Married" if combining employment and family domains).
This is useful when you want to work with the combined sequences directly, or when your data comes from separate CSV files that you need to merge into a unified sequence format.
Function Usage
A minimal example with only the required parameters:
sequence_data = create_idcd_sequence_from_csvs(csv_paths, time_cols)A complete example with all available parameters:
sequence_data = create_idcd_sequence_from_csvs(
csv_paths, # required: list of CSV file paths
time_cols, # required: list of time column names
id_col="id", # optional: ID column name
domain_state_labels=None # optional: label mappings for each domain
)Entry Parameters
| Parameter | Required | Type | Description |
|---|---|---|---|
csv_paths | ✓ | list[str] | A list of file paths to CSV files, one for each domain. Each CSV should contain the same ID column and the same time columns. The number of rows (individuals) must be the same across all CSVs, and IDs must match across files. |
time_cols | ✓ | list[str] | A list of column names representing time points. These columns must exist in all CSV files specified in csv_paths. The order in this list determines the order of time points in the resulting sequences. |
id_col | ✗ | str | The name of the ID column used to align individuals across CSV files. Must exist in all CSV files. Default = "id". |
domain_state_labels | ✗ | list[dict] or None | A list of dictionaries, one for each domain, that maps raw state values to human-readable labels. Each dictionary maps the state values found in that domain's CSV to their labels. If None, raw state values are used as labels. Default = None. |
What It Does
The function performs the following steps:
Loads CSV files: Reads each CSV file specified in
csv_pathsand validates that required columns exist.Aligns data: Sorts all dataframes by the ID column to ensure individuals are aligned in the same order across domains.
Creates composite states: For each individual and each time point, combines states from all domains using the "+" separator. For example:
- If domain 1 has state "A" and domain 2 has state "B" at time T1, the composite state becomes "A+B"
- If using numeric codes like 0 and 1, the composite might be "0+1"
Identifies observed states: Only keeps composite states that actually appear in your data. Unlike CAT which creates all possible combinations, IDCD only uses combinations that are observed.
Builds labels (optional): If
domain_state_labelsis provided, creates human-readable labels for each composite state. For example, "0+1" might become "Employed + Married".Creates SequenceData object: Returns a
SequenceDataobject containing:- Combined sequences with composite states
- An alphabet consisting only of observed composite states
- Labels for these states (if provided)
- Individual IDs
Prints frequency table: Displays a table showing how frequently each observed composite state appears in your data.
Examples
1. Basic usage with two domains
from sequenzo.multidomain.idcd import create_idcd_sequence_from_csvs
# Paths to your CSV files
csv_paths = [
"employment_domain.csv", # Domain 1: Employment
"family_domain.csv" # Domain 2: Family
]
# Time columns that exist in both CSVs
time_cols = ["Y1", "Y2", "Y3", "Y4"]
# Create IDCD sequence data
sequence_data = create_idcd_sequence_from_csvs(
csv_paths=csv_paths,
time_cols=time_cols,
id_col="id"
)Your CSV files should look like this:
employment_domain.csv:
id,Y1,Y2,Y3,Y4
1,Employed,Employed,Unemployed,Employed
2,Employed,Unemployed,Unemployed,Employed
3,Unemployed,Unemployed,Employed,Employedfamily_domain.csv:
id,Y1,Y2,Y3,Y4
1,Single,Single,Married,Married
2,Married,Married,Married,Married
3,Single,Married,Married,MarriedThe resulting sequences will have composite states like "Employed+Single", "Employed+Married", "Unemployed+Single", "Unemployed+Married".
2. Using numeric states with labels
If your CSVs use numeric codes (0, 1, etc.) and you want readable labels:
# Define label mappings for each domain
domain_state_labels = [
{0: "Employed", 1: "Unemployed"}, # Domain 1 labels
{0: "Single", 1: "Married"} # Domain 2 labels
]
sequence_data = create_idcd_sequence_from_csvs(
csv_paths=csv_paths,
time_cols=time_cols,
id_col="id",
domain_state_labels=domain_state_labels
)Your CSV files would use numeric codes:
employment_domain.csv:
id,Y1,Y2,Y3,Y4
1,0,0,1,0
2,0,1,1,0
3,1,1,0,0family_domain.csv:
id,Y1,Y2,Y3,Y4
1,0,0,1,1
2,1,1,1,1
3,0,1,1,1The function will create composite states like "0+0", "0+1", "1+0", "1+1", but display them with labels like "Employed + Single", "Employed + Married", etc.
3. Three domains example
csv_paths = [
"employment_domain.csv",
"family_domain.csv",
"education_domain.csv"
]
time_cols = ["Y1", "Y2", "Y3"]
domain_state_labels = [
{0: "Employed", 1: "Unemployed"},
{0: "Single", 1: "Married"},
{0: "Low", 1: "High"}
]
sequence_data = create_idcd_sequence_from_csvs(
csv_paths=csv_paths,
time_cols=time_cols,
id_col="id",
domain_state_labels=domain_state_labels
)This will create composite states like "0+0+0", "0+1+1", etc., with labels like "Employed + Single + Low", "Employed + Married + High", etc.
4. Understanding the output
The function prints a frequency table showing observed states:
[IDCD] Observed Combined States Frequency Table:
State Label Frequency Proportion (%)
Employed+Single Employed + Single 15 12.5
Employed+Married Employed + Married 45 37.5
Unemployed+Single Unemployed + Single 10 8.3
Unemployed+Married Unemployed + Married 50 41.7This shows:
- Which composite states actually appear in your data
- How frequently each state appears
- The proportion of all observations accounted for by each state
5. Using the resulting SequenceData
After creating the IDCD sequence data, you can use it like any other SequenceData object:
# Create IDCD sequences
sequence_data = create_idcd_sequence_from_csvs(
csv_paths=csv_paths,
time_cols=time_cols
)
# Use with distance computation
from sequenzo.dissimilarity_measures import get_distance_matrix
distance_matrix = get_distance_matrix(
seqdata=sequence_data,
method="OM",
sm="TRATE",
indel="auto"
)
# Use with clustering
from sequenzo.clustering.hierarchical_clustering import Cluster
cluster_result = Cluster(matrix=distance_matrix, entity_ids=sequence_data.ids)
labels = cluster_result.get_cluster_labels(num_clusters=3)Understanding IDCD vs Other Approaches
IDCD differs from CAT and DAT in important ways:
IDCD (Individual Domain Combined Data):
- Creates combined sequences upfront
- Uses only observed state combinations
- Simpler workflow: combine first, then analyze
- Good when you want to work directly with combined sequences
- Alphabet size depends on what combinations actually occur in your data
CAT (Combined Alphabet Technique):
- Keeps domains separate until distance computation
- Creates all possible state combinations for cost computation
- More complex: computes costs on composite states
- Better when you want explicit control over substitution costs
- Alphabet size is the product of domain alphabet sizes
DAT (Distance Additive Trick):
- Keeps domains separate throughout
- Adds domain-specific distance matrices
- Simplest: no composite states needed
- Most efficient computationally
- Doesn't consider cross-domain state combinations explicitly
When to use IDCD:
- Your data comes from separate CSV files that need to be combined
- You want to work with combined sequences directly
- You prefer a simpler workflow without complex cost calculations
- You only care about observed combinations, not all possible ones
Important Notes
File requirements: All CSV files must have:
- The same ID column (specified by
id_col) - The same time columns (specified by
time_cols) - The same number of rows (individuals)
- Matching IDs across files (individuals must be in the same order after sorting)
- The same ID column (specified by
State separator: The function uses "+" to join states from different domains. Make sure this character doesn't appear in your actual state values, or the parsing might fail.
Observed states only: Unlike other approaches that might consider all possible combinations, IDCD only uses combinations that actually appear in your data. This means:
- If no individual has state combination "A+B+C", it won't be in the alphabet
- Your alphabet size may be smaller than the product of domain alphabet sizes
- This can be more efficient but might miss theoretically possible transitions
Label mapping: When providing
domain_state_labels:- The list should have one dictionary per domain, in the same order as
csv_paths - Each dictionary maps raw values (as they appear in the CSV) to labels
- If a state value isn't in the dictionary, the raw value will be used as the label
- Labels are only used for display; the actual states in sequences remain as composite strings
- The list should have one dictionary per domain, in the same order as
Data alignment: The function sorts all dataframes by ID before combining. Make sure IDs match across files. If an individual appears in one file but not another, you'll get a mismatch error.
Missing values: Handle missing values in your CSV files before using this function. Missing values might be converted to strings like "nan" in composite states, which may not be what you want.
Common Use Cases
Multiple survey waves: When you have separate files for different domains measured across the same time periods.
Data from different sources: When employment, family, and education data come from different databases that need to be merged.
Preliminary exploration: When you want to quickly see what combined state combinations occur in your data before doing more complex multidomain analysis.
Simplified workflow: When you prefer combining domains first and then using standard sequence analysis tools, rather than using specialized multidomain distance computation methods.
Author
Code: Yuqi Liang
Documentation: Yuqi Liang