Skip to content

Pairfam Family Trajectories Dataset

This dataset contains German 1,866 individuals of family formation, observed monthly from ages 18 to 40 (264 months). It is derived from the German Family Panel (pairfam, Release 14.2) and was pre-processed by the authors of Sequence Analysis (Raab & Struffolino, 2022). It is designed for teaching and learning sequence analysis by providing ready-to-use monthly trajectories of family formation.

Data origin and processing

  • Source: pairfam, a large-scale longitudinal survey on partnership and family dynamics in Germany.

  • Processing by book authors:

    1. Partnership status (single, LAT, cohabiting, married) was combined with parental status (number of children).
    2. For non-married statuses, only the distinction between “with children” vs. “without children” was kept.
    3. For married statuses, an additional distinction between one child vs. two or more children was made.
    4. Rare combinations (e.g., single with 2+ children) were collapsed into the broader “with children” category.
  • Result: A simplified 9-state alphabet, recoded numerically (1–9) in pairfam_family.csv.

Family states (numeric coding)

Numeric codeAbbreviationDescription
1SSingle, no child
2LATLAT, no child
3COHCohabiting, no child
4MARMarried, no child
5ScSingle, with child[ren]
6LATcLAT, with child[ren]
7COHcCohabiting, with child[ren]
8MARc1Married, 1 child
9MARc2+Married, 2+ children

Other columns

Besides the state sequences, the dataset includes several other variables:

ColumnDescription
idIndividual identifier
weight40Survey weight at age 40 (design weight)
sexSex (1 = male, 0 = female)
doby_genYear of birth (generation year)
dobMonth-year of birth (numerical encoding)
ethniEthnicity indicator
migstatusMigration background status
yeducYears of education
sat1i4, sat5, sat6Selected satisfaction indicators from the survey
highschoolHigh school graduation status
churchChurch attendance indicator
biosibNumber of biological siblings
stepsibNumber of step-siblings
eastRegion indicator (East vs. West Germany)
famstructure18Family structure at age 18
state1 … state264Monthly family trajectory states from age 15 onward, coded 1–9 as above

Sample data

Below is a small extract of the dataset:

idweight40sexdoby_gendobethnimigstatusyeduchighschoolchurchbiosibeaststate1state2state3state4state5
1110000.344119718551111.5001155555
16240001.467119738801111.5011111111
27670000.46411971853119.0003011111
29310001.767019738815310.5011011111
31670000.885119738831111.5001033333

Here state1state5 show the first five months of the trajectory, coded as 1–9 according to the state table above.

Data preprocessing

To make the data more convenient to use, we performed a minor preprocessing step, converting state1 ... state264 to 1 ... 264 before adding it to our prepared dataset.

The data preprocessing function we use is clean_time_columns_auto(). Simply put, it is a smart tool for cleaning column names. Its main purpose is to automatically scan a DataFrame, identify columns with names containing numbers (e.g., state1, wave2, year2023), and then simplify these names to just the numbers they contain (becoming 1, 2, 2023). This feature is particularly useful when processing time-series or panel data, as it allows for the quick standardization of column names that represent different points in time.

Related parameters:

  • df: The DataFrame you want to process.
  • protect: A list of protected column names. The names written here (for instance, 'id', 'sex', 'age', etc.) will not be automatically changed by the function and will be kept in their original form.
  • min_time and max_time (Optional): A time range for filtering. You can use it to tell the function to only handle columns where the number in the name falls within a specific range.

Here are the detailed steps. You can also refer to the tutorial of clean_time_columns.

python
# import dependencies

import re
import pandas as pd

#load the data and preview it

df = pd.read_csv('D:\\sequenzo\\family.csv')

df
python
# check all the columns name
columns_name_list = df.columns.to_list()

columns_name_list
python
def clean_time_columns_auto(
    df: pd.DataFrame,
    protect=('id',
 'weight40',
 'sex',
 'doby_gen',
 'dob',
 'ethni',
 'migstatus',
 'yeduc',
 'sat1i4',
 'sat5',
 'sat6',
 'highschool',
 'church',
 'biosib',
 'stepsib',
 'east',
 'famstructure18',),                     # Keep these column names as they are
    min_time=1, max_time=None            # Define the time range for selection
) -> pd.DataFrame:
    rename_map = {}
    for c in df.columns:
        if c in protect:
            continue

        m = re.search(r"(\d+)", str(c))
        if not m:
            # No digits found: skip renaming
            continue

        new_label = str(int(m.group(1))) # Standarization "01" --> "1"

        # Optional constraints (if needed)
        if max_time is not None:
            t = int(new_label)
            if t < min_time or t > max_time:
                continue

        rename_map[c] = new_label

    # Defensive measure: Avoid duplicate column names
    if len(set(rename_map.values())) != len(rename_map.values()):
        raise ValueError(
            f"Name collision detected: {rename_map}. "
            f"Please adjust regex or time range."
        )

    return df.rename(columns=rename_map).copy(),rename_map
python
df_clean,rename_map = clean_time_columns_auto(df,protect=('id',
 'weight40',
 'sex',
 'doby_gen',
 'dob',
 'ethni',
 'migstatus',
 'yeduc',
 'sat1i4',
 'sat5',
 'sat6',
 'highschool',
 'church',
 'biosib',
 'stepsib',
 'east',
 'famstructure18',))

print(df_clean.head())

If you need to log the "old column name -> new column name" mapping (for logging or to ensure future reproducibility), you can do it as follows:

python

old_to_new = rename_map.copy()
# Save to a local file
pd.Series(old_to_new).to_csv("time_col_rename_map.csv", header=["new_name"])

print("\nColumn name mapping has been successfully saved to 'time_col_rename_map.csv'!")

Below is a small extract of the dataset after preprocessing:

idweight40sexdoby_gendobethnimigstatusyeduchighschoolchurchbiosibeast12345
1110000.344119718551111.5001155555
16240001.467119738801111.5011111111
27670000.46411971853119.0003011111
29310001.767019738815310.5011011111
31670000.885119738831111.5001033333

Reference

Raab, M., & Struffolino, E. (2022). Sequence analysis (Vol. 190). Sage Publications.

Author: Yuqi Liang

Released under the BSD-3-Clause License.