Skip to content

Pairfam Family Trajectories Dataset

This dataset contains German 1,027 individuals of family formation trajectories. It is derived from the German Family Panel (pairfam, Release 14.2) and was pre-processed by the authors of Sequence Analysis (Raab & Struffolino, 2022). It is designed for teaching and learning sequence analysis by providing ready-to-use trajectories of family formation.

We provide two versions of the dataset:

  • Year-level data: 22 yearly observations with state abbreviations
  • Month-level data: 264 monthly observations (ages 18 to 40) with numeric state codes

Important Notes

  • The IDs are different between year-level and month-level data and cannot be directly linked.
  • State encoding differs: Year-level uses text abbreviations (e.g., "S", "LAT"), while month-level uses numeric codes (1–9).
  • The underlying state definitions remain the same across both versions.

Data origin and processing

  • Source: pairfam, a large-scale longitudinal survey on partnership and family dynamics in Germany.

  • Processing by book authors:

    1. Partnership status (single, LAT, cohabiting, married) was combined with parental status (number of children).
    2. For non-married statuses, only the distinction between "with children" vs. "without children" was kept.
    3. For married statuses, an additional distinction between one child vs. two or more children was made.
    4. Rare combinations (e.g., single with 2+ children) were collapsed into the broader "with children" category.
  • Our preprocessing: To make the data more convenient to use, we performed a minor preprocessing step, converting state1 ... state264 to 1 ... 264 before adding it to our prepared dataset.

    The data preprocessing function we use is clean_time_columns_auto(). Simply put, it is a smart tool for cleaning column names. Its main purpose is to automatically scan a DataFrame, identify columns with names containing numbers (e.g., state1, wave2, year2023), and then simplify these names to just the numbers they contain (becoming 1, 2, 2023). This feature is particularly useful when processing time-series or panel data, as it allows for the quick standardization of column names that represent different points in time.

    For more details on how we cleaned and prepared the data, see the data cleaning code repository.

  • Result: A simplified 9-state alphabet.

Family states encoding

Numeric CodeAbbreviationDescription
1SSingle, no child
2LATLAT, no child
3COHCohabiting, no child
4MARMarried, no child
5ScSingle, with child[ren]
6LATcLAT, with child[ren]
7COHcCohabiting, with child[ren]
8MARc1Married, 1 child
9MARc2+Married, 2+ children

Year-level data

File: pairfam_family_by_year.csv

This dataset contains 1,029 individuals observed over 22 years. The states are encoded directly as text abbreviations (e.g., "S", "LAT", "COH", "MAR", "Sc", "LATc", "COHc", "MARc1", "MARc2+").

No Covariates

Unlike the month-level data, the year-level data does not include any covariates. The id column contains randomly generated identifiers created during our preprocessing and cannot be linked to other datasets or the month-level data.

Structure

ColumnDescription
idRandomly generated identifier (simple sequential integers:
194, 896, 284, ..., cannot be linked to other datasets)
122Yearly family trajectory states, encoded as abbreviations (e.g., "S", "LAT")

Sample data

id12345678910
194COHMARc1LATcLATcLATcLATcScScScSc
896SSSSSSSLATLATLAT
284SSLATLATSSSLATSS
886LATSLATLATSSLATLATLATLAT

Month-level data

File: pairfam_family_by_month.csv

This dataset contains 1,027 individuals observed monthly from ages 18 to 40 (264 months). The states are encoded as numeric codes (1–9) according to the encoding table above.

Structure

Besides the state sequences, the dataset includes several covariates:

ColumnDescription
idIndividual identifier (original pairfam IDs, e.g., 111000, 2931000)
weight40Survey weight at age 40 (design weight)
sexSex (1 = male, 0 = female)
doby_genYear of birth (generation year)
dobMonth-year of birth (numerical encoding)
ethniEthnicity indicator
migstatusMigration background status
yeducYears of education
sat1i4, sat5, sat6Selected satisfaction indicators from the survey
highschoolHigh school graduation status
churchChurch attendance indicator
biosibNumber of biological siblings
stepsibNumber of step-siblings
eastRegion indicator (East vs. West Germany)
famstructure18Family structure at age 18
1264Monthly family trajectory states, coded 1–9 as above

Sample data

idweight40sexdoby_gendobethnimigstatusyeduchighschoolchurchbiosibeast12345
1110000.344119718551111.5001155555
29310001.767019738815310.5011011111
34910000.727119718571118.0113011111

Here columns 15 show the first five months of the trajectory, coded as 1–9 according to the state table above.


Multichannel data (reference only)

File: MultiChannel.csv

This file combines both family and activity trajectories in a single dataset, with columns prefixed by family and activity respectively. This is useful for multichannel sequence analysis.

Note

MultiChannel.csv is not supported by load_dataset() and is provided for reference only. You can download it manually from the month-level data sources repository.

Structure

ColumnDescription
idIndividual identifier (original pairfam IDs)
CovariatesSame as month-level data above
family1family264Monthly family trajectory states (numeric codes 1–9)
activity1activity264Monthly activity trajectory states (numeric codes 1–8)

Reference

Raab, M., & Struffolino, E. (2022). Sequence analysis (Vol. 190). Sage Publications.

Author: Yuqi Liang