Skip to content

get_kob_decomposition()

get_kob_decomposition() performs a twofold Kitagawa–Oaxaca–Blinder (KOB) decomposition of a mean outcome gap between two groups. It splits the gap into a composition (explained) part and coefficient (unexplained returns) part, plus an intercept component.

Positive total_gap means group0 has a higher mean outcome than group1. Control the mapping with group0_value and group1_value.

Function Usage

python
get_kob_decomposition(
    y,
    group,
    X,
    variable_names=None,
    term_ids=None,
    reference="group0",
    majority_owner=None,
    coefficient_owner_by_column=None,
    group0_value=None,
    group1_value=None,
    normalize_categorical=False,
    categorical_terms=None,
    category_ids=None,
    n_categories_by_term=None,
    owner_by_category_by_term=None,
    drop_missing=False,
)

Reference Coefficients

referenceMeaning
"group0"Use group 0 coefficients when owner is -1
"group1"Use group 1 coefficients when owner is -1
"pooled"Use coefficients from pooled OLS without a group indicator when owner is -1

These choices correspond to common twofold decomposition references (Jann, 2008). SA–KOB uses them as fallbacks for non-cluster controls and unspecified owners, and adds cluster-specific owners for Rowold et al.'s option III via owner_by_category_by_term. For manual category-level references in generic KOB, set normalize_categorical=True and pass owner_by_category_by_term; see get_sa_kob_decomposition().

Entry Parameters

ParameterRequiredTypeDescription
ynp.ndarray1D continuous outcome.
groupnp.ndarrayBinary grouping variable (exactly two values).
Xnp.ndarray2D design matrix, one row per observation.
variable_nameslist[str]Column labels for reporting. Default: X1, X2, …
term_idsarray-likeGroup columns into terms (e.g., all dummies of one factor share an id). Default: one term per column.
referencestr"group0", "group1", or "pooled". Default: "group0".
coefficient_owner_by_columnarray-likePer-column owner: 0, 1, or -1 (use reference).
majority_ownerarray-likeDeprecated. Per-term owners expanded to columns.
group0_valueanyLabel treated as group 0.
group1_valueanyLabel treated as group 1.
normalize_categoricalboolYun-normalized category contributions. Default: False.
categorical_termslist[int]Which term_id values are categorical. If omitted and normalize_categorical=True, Sequenzo treats terms with more than one column as categorical; pass this explicitly when using non-dummy multi-column terms.
category_idsarray-likeInternal category id per column.
n_categories_by_termdictTotal categories per term (including omitted baseline).
owner_by_category_by_termdictPer-category coefficient owners for categorical terms.
drop_missingboolDrop rows with non-finite y or X. Default: False.

What It Returns

A KOBDecompositionResult:

FieldTypeDescription
total_gapfloatgroup0_mean - group1_mean
explainedfloatComposition effect
unexplained_returnsfloatReturns / coefficient effect
unexplained_interceptfloatIntercept difference
by_columnpd.DataFrameExplained and returns per column
by_termpd.DataFrameAggregated contributions by term_id
by_categorypd.DataFrameCategory detail when normalize_categorical=True
group0_mean, group1_meanfloatOutcome means
group0_label, group1_labelanyResolved labels
gap_directionstre.g. "men minus women"
diagnosticsdictOLS diagnostics and normalization notes

by_variable is an alias for by_column.

Examples

Step 1: Prepare data

python
from sequenzo import load_dataset

df = load_dataset("mvad")
y = df["funemp"].to_numpy()  # example numeric outcome
group = df["male"].map({0: "women", 1: "men"}).to_numpy()
X = df[["fmpr", "emp97"]].to_numpy()

Step 2: Fit KOB decomposition

python
from sequenzo.decomposition import get_kob_decomposition

result = get_kob_decomposition(
    y=y,
    group=group,
    X=X,
    variable_names=["fmpr", "emp97"],
    group0_value="men",
    group1_value="women",
    reference="group0",
)

Step 3: Read components

python
print(result.gap_direction)
print(f"Total gap:    {result.total_gap:.4f}")
print(f"Explained:    {result.explained:.4f}")
print(f"Returns:      {result.unexplained_returns:.4f}")
print(f"Intercept:    {result.unexplained_intercept:.4f}")
print(result.by_column)

Step 4 (optional): Categorical dummies with normalization

python
import pandas as pd

dummies = pd.get_dummies(df["emp97"], prefix="emp", drop_first=True)
X_cat = dummies.to_numpy()
term_ids = [0] * X_cat.shape[1]

result_cat = get_kob_decomposition(
    y=y,
    group=group,
    X=X_cat,
    variable_names=list(dummies.columns),
    term_ids=term_ids,
    normalize_categorical=True,
    categorical_terms=[0],
    n_categories_by_term={0: df["emp97"].nunique()},
    group0_value="men",
    group1_value="women",
)
print(result_cat.by_category)

R / Stata Counterparts

  • Closest R approach: oaxaca package (twofold decomposition)
  • Mapping note: Sequenzo follows the standard twofold decomposition setup discussed by Jann (2008), using separate group-specific OLS fits and an explicit reference-coefficient vector.

Notes

  • group must have exactly two distinct values.
  • Provide both group0_value and group1_value, or neither.
  • by_category is empty unless normalize_categorical=True.
  • Scalar totals keep the raw twofold identity; normalized category tables may sum differently in mixed-reference settings.
  • With normalize_categorical=True and omitted categorical_terms, Sequenzo auto-detects multi-column terms and emits a warning; pass categorical_terms explicitly for non-dummy multi-column terms.
  • For bootstrap uncertainty, use get_kob_decomposition_bootstrap().

Authors

Code: Yuqi Liang

Documentation: Yuqi Liang

References

Jann, B. (2008). The Blinder–Oaxaca decomposition for linear regression models. The Stata Journal, 8(4), 453–479.

Rowold, C., Struffolino, E., & Fasang, A. E. (2025). Life-course-sensitive analysis of group inequalities: Combining sequence analysis with the Kitagawa–Oaxaca–Blinder decomposition. Sociological Methods & Research, 54(2), 646–705.