Converting Numeric Data to Categorical Data

In sequence analysis, we usually work with categorical states (e.g., "Education", "Employment", "Unemployed"). But many datasets contain numeric values (e.g., income, CO₂ emissions, examnination scores).

To use such data in sequence analysis, we need to convert numeric values into categories (states).

Why Do We Need Categories?

Sequence analysis is about comparing paths of states.
Numeric values can be too fine-grained (e.g., CO₂ emissions = 0.71 vs 0.72) and make no sense to treat as separate “states.”
Categories make it possible to study categorical patterns, trajectories, and transitions instead of focusing on tiny numeric differences.

What Can We Study Once Numeric Data Is Categorized?

By converting them into categorical states, we can treat them as categorical trajectories and ask the following sequence-based questions:

1. Income Trajectories

Numeric variable: annual income (in USD).
Categorical conversion: income deciles (bottom 10%, 10–20%, … top 10%).
Research questions:
- How do individuals move across income brackets over their careers?
- Are there “sticky floors” (people remaining in the bottom deciles) or “glass ceilings” (barriers to reaching top deciles)?
- How do income trajectories differ by gender, education, or region?

2. CO₂ Emissions per Capita

Numeric variable: annual CO₂ emissions per capita (metric tons).
Categorical conversion: global deciles (very low → very high).
Research questions:
- How do countries move between emission categories over time?
- Which regions show convergence (becoming similar) vs divergence (becoming more unequal) in their emission trajectories?
- Are there typical “paths” countries follow (e.g., from low → middle → high emissions as they industrialize)?

3. Examination Scores (e.g., students across school years)

Numeric variable: standardized test scores (0–100).
Categorical conversion: performance groups (Low, Average, High).
Research questions:
- What are the typical learning trajectories students follow? (e.g., consistently high, late improvement, early decline)
- Do certain groups of students converge toward similar performance categories over time?
- How do interventions (e.g., tutoring programs) change the sequence of performance states?

Common Methods for Conversion

Fixed Thresholds (Domain Knowledge)
- You set cutoffs based on meaningful criteria.
- Example: Body Mass Index (BMI) → Underweight (<18.5), Normal (18.5–24.9), Overweight (25–29.9), Obese (30+).
- Advantage: Easy to interpret.
- Limitation: Requires expert knowledge; may not reflect the data distribution.
Quantiles (Data-Driven Binning)
- Split data into equal-sized groups using percentiles, quartiles, quintiles, or deciles.
- Example: Income → Bottom 20% = “Low”, 20–40% = “Lower-Middle”, …, Top 20% = “High”.
- Advantage: Balanced group sizes; good for comparisons.
- Limitation: Categories are relative to the dataset, not absolute values.
Standard Deviation from Mean (Z-Scores)
- Define categories based on how far a value is from the mean.
- Example: Test Scores → Below −1σ = “Low Performer”, −1σ to +1σ = “Average”, Above +1σ = “High Performer”.
- Advantage: Good when data is approximately normal.
- Limitation: Harder for non-technical audiences to interpret.

Step-by-Step Example: CO₂ Emissions

Suppose we have per capita CO₂ emissions (in metric tons):

Country	Year	CO₂ per capita
A	2000	0.8
B	2000	2.5
C	2000	7.2
D	2000	15.0

Using Quantiles (Deciles)

Collect all values: [0.8, 2.5, 7.2, 15.0].
Compute decile cutoffs (10 equally sized bins). For simplicity:
- 0–10% = 0.8
- 10–20% = 1.5
- 20–30% = 2.5
- …
- 90–100% = 15.0
Assign states:
- Country A = 1st decile (very low)
- Country B = 3rd decile (low-middle)
- Country C = 7th decile (high)
- Country D = 10th decile (very high)

Now we can represent each country’s emission history as a sequence of categorical states instead of raw numbers.

With categories, we can say:
- Country A stayed in “very low” for 20 years.
- Country C moved from “middle” to “high” after 1990.
Without categories, comparing 0.71 vs 0.72 would not give meaningful patterns.

If you would like to learn further how it is computed in Python, we have a complete guide with code in CO₂ Emissions Dataset (1800–2022) 🌍.

Key Takeaways

Sequence analysis requires categorical states → numeric values must be grouped.
You can choose thresholds (expert-defined), quantiles (data-driven), or statistical cutoffs.
Deciles are often a good balance: enough granularity to see differences, but not too many categories to overwhelm the analysis.

Exercises: Converting Numeric Data to Categorical Data

Exercise 1: Why Categories?

Suppose you measure daily step counts from wearable devices (e.g., 9,950 vs 10,020).
Why might it be unhelpful to treat each exact number as a separate state in sequence analysis?

Exercise 2: Household Electricity Use

Four households report their monthly electricity consumption (kWh):

Household	Consumption
A	120
B	220
C	350
D	800

Convert these values into quartiles (4 equal-sized groups).
Assign each household to a quartile category (Q1 = Very Low, Q2 = Low-Medium, Q3 = Medium-High, Q4 = Very High).

Exercise 3: Blood Pressure Categories

A health survey records systolic blood pressure: 110, 125, 138, 162.
Use the standard clinical thresholds:

Normal: <120
Elevated: 120–129
Hypertension Stage 1: 130–139
Hypertension Stage 2: ≥140

Assign each person to a category.

Exercise 4: Student Attendance (Z-Scores)

Absences from school (days per year) are: 2, 4, 6, 8, 12.
The mean = 6.4, standard deviation = 3.4.
Define categories:

Below −1σ → "Very Low Absence"
−1σ to +1σ → "Typical Absence"
Above +1σ → "High Absence"

Classify each student.

Exercise 5: Water Pollution Levels

Average river nitrate concentration (mg/L) is reported for two regions:

Region	2000	2010	2020
A	0.5	1.2	2.8
B	6.0	6.5	7.5

Define three categories using equal-width binning:
- Low: 0–2
- Medium: 2–5
- High: >5
Represent each region’s trajectory as a sequence of categorical states.

Answers and Explanations

Exercise 1

Tiny numeric differences (e.g., 9,950 vs 10,020 steps) don’t indicate meaningful differences in behavior.
Sequence analysis is about paths of discrete states (e.g., “Sedentary,” “Moderately Active,” “Highly Active”). Categorization allows us to focus on interpretable patterns.

Exercise 2

Sorted consumption: [120, 220, 350, 800].
Quartile assignment:

Q1 = 120 → Very Low
Q2 = 220 → Low-Medium
Q3 = 350 → Medium-High
Q4 = 800 → Very High

Exercise 3

110 → Normal (<120)
125 → Elevated (120–129)
138 → Hypertension Stage 1 (130–139)
162 → Hypertension Stage 2 (≥140)

Exercise 4

Z-scores = (value − mean)/σ:

2 → (2−6.4)/3.4 ≈ −1.29 → Very Low Absence
4 → −0.70 → Typical Absence
6 → −0.12 → Typical Absence
8 → 0.47 → Typical Absence
12 → 1.65 → High Absence

Exercise 5

Using bins: Low (0–2), Medium (2–5), High (>5).

Region A: 0.5 (Low) → 1.2 (Low) → 2.8 (Medium) → Sequence = [Low, Low, Medium]
Region B: 6.0 (High) → 6.5 (High) → 7.5 (High) → Sequence = [High, High, High]

Interpretation:

Region A shows a shift from Low to Medium pollution.
Region B remains consistently High.

Author: Yuqi Liang

Converting Numeric Data to Categorical Data ​

Why Do We Need Categories? ​

What Can We Study Once Numeric Data Is Categorized? ​

1. Income Trajectories ​

2. CO₂ Emissions per Capita ​

3. Examination Scores (e.g., students across school years) ​

Common Methods for Conversion ​

Step-by-Step Example: CO₂ Emissions ​

Using Quantiles (Deciles) ​

Key Takeaways ​

Exercises: Converting Numeric Data to Categorical Data ​

Exercise 1: Why Categories? ​

Exercise 2: Household Electricity Use ​

Exercise 3: Blood Pressure Categories ​

Exercise 4: Student Attendance (Z-Scores) ​

Exercise 5: Water Pollution Levels ​

Answers and Explanations ​

Exercise 1 ​

Exercise 2 ​

Exercise 3 ​

Exercise 4 ​

Exercise 5 ​