🔄 Chapter 13 — Data Transformation and Discretization¶
This chapter explores data transformation strategies, normalization, discretization by binning, and concept hierarchy generation for nominal data.
⚙️ 1. Overview of Data Transformation¶
Data Transformation converts data into an appropriate format or structure for analysis.
It improves data quality and makes algorithms more efficient and comparable.
graph TD
A[Raw Data] --> B[Data Transformation]
B --> C[Normalized Data]
B --> D[Discretized Data]
B --> E[Hierarchical Categories] Objectives¶
- Scale numeric data into comparable ranges
- Reduce noise and variation
- Improve model convergence
- Simplify representation (e.g., categorical grouping)
📏 2. Data Transformation Strategies¶
| Transformation Type | Description | Example |
|---|---|---|
| Smoothing | Remove noise from data | Moving averages |
| Aggregation | Summarize values | Weekly → monthly sales |
| Generalization | Replace detailed data with higher-level concepts | “Student Age” → “Age Group” |
| Normalization | Scale data to fixed range | [0,1] scaling |
| Discretization | Convert continuous to categorical | Age → Young, Middle, Old |
| Attribute Construction | Create new derived features | BMI = weight/height² |
graph LR
A[Raw Attributes] --> B[Transformation Methods]
B --> C[Normalized Values]
B --> D[Discretized Intervals]
B --> E[Constructed Attributes] 📉 3. Data Transformation by Normalization¶
Definition¶
Normalization adjusts numeric data to a common scale without distorting differences in ranges or distributions.
3.1 Min–Max Normalization¶
Maps data into the range [0,1]:
Example:
Given values [10, 15, 20, 25], normalize 20.
| Original | Normalized |
|---|---|
| 10 | 0.0 |
| 15 | 0.33 |
| 20 | 0.67 |
| 25 | 1.0 |
3.2 Z‑Score Normalization¶
Standardizes data using mean (μ) and standard deviation (σ):
Example:
x = 50, μ = 40, σ = 5
→ \( x' = (50 - 40) / 5 = 2.0 \)
| Original | Z‑score |
|---|---|
| 30 | −2 |
| 40 | 0 |
| 50 | +2 |
3.3 Decimal Scaling Normalization¶
Moves the decimal point of values to bring them within [−1,1]:
where j = smallest integer such that max(|x'|) < 1
Example:
x = 345 → \( x' = 0.345 \) since j = 3
🔢 4. Discretization by Binning¶
Definition¶
Discretization converts continuous attributes into categorical intervals (bins).
It helps reduce small observation effects and simplify patterns.
4.1 Types of Binning¶
| Type | Description | Example |
|---|---|---|
| Equal‑width | Divide range into equal intervals | 0–10, 10–20, 20–30 |
| Equal‑frequency | Each bin has equal number of samples | 5 per bin |
| Supervised | Bins chosen based on class labels | Decision tree splits |
4.2 Example — Equal‑width Binning¶
Data: [4, 8, 15, 21, 24, 25, 28, 34, 35, 40]
Range = 40 − 4 = 36
If 3 bins → Bin width = 36 / 3 = 12
| Bin | Interval | Values |
|---|---|---|
| 1 | [4–16) | 4, 8, 15 |
| 2 | [16–28) | 21, 24, 25 |
| 3 | [28–40] | 28, 34, 35, 40 |
graph TD
A[Continuous Data] --> B[Equal-width Intervals]
B --> C[Bin 1: 4-16]
B --> D[Bin 2: 16-28]
B --> E[Bin 3: 28-40] 4.3 Smoothing by Binning¶
Within each bin, replace values by: - Bin mean (average) - Bin median - Bin boundary values
Example (Bin 1 = [4, 8, 15])
→ Mean smoothing: replace with 9 (average)
Smoothed data: [9, 9, 9, 23, 23, 23, 34, 34, 34, 34]
🧭 5. Concept Hierarchy Generation for Nominal Data¶
Definition¶
For categorical attributes, data can be organized into concept hierarchies that represent levels of abstraction.
graph TD
A[City] --> B[State]
B --> C[Country]
C --> D[Continent] 5.1 Example — Location Hierarchy¶
| Level | Example |
|---|---|
| City | Hyderabad |
| State | Telangana |
| Country | India |
| Continent | Asia |
5.2 Hierarchy Generation Methods¶
| Approach | Description | Example |
|---|---|---|
| Explicit specification | Defined by domain expert | City → State → Country |
| Automatic generation | Derived from data attributes | ZIP → City → Region |
| Schema-based | Based on database schema | Employee_ID → Department → Division |
🧠 6. Summary Table¶
| Process | Purpose | Techniques |
|---|---|---|
| Transformation | Adjust data scale or distribution | Normalization, smoothing |
| Normalization | Scale numeric attributes | Min–max, Z‑score, Decimal scaling |
| Discretization | Convert numeric to categorical | Binning, decision tree splits |
| Concept Hierarchy | Abstract categorical data | City→State→Country |
📘 7. Practice Questions¶
- Explain the importance of data transformation before modeling.
- Derive the formula for Z‑score normalization and compute for x=70, μ=50, σ=10.
- Perform equal‑width binning for the dataset [3,5,7,9,11,13,15] into 2 bins.
- Differentiate between equal‑width and equal‑frequency discretization.
- Draw a concept hierarchy for “Product Category” (e.g., Item → Type → Department → Store).