🧹 Chapter 10 — Data Preprocessing¶
This chapter explains data quality issues and the essential steps in data preprocessing including cleaning, handling missing values, and noise reduction.
🧠 1. Why Preprocess the Data?¶
Raw data is rarely perfect. It may contain missing values, inconsistencies, noise, or outliers. Data preprocessing ensures the dataset is accurate, complete, and consistent before modeling.
graph TD
A[Raw Data] --> B[Data Preprocessing]
B --> C[Clean Data]
C --> D[Model Training] Key Reasons¶
- Improved Model Accuracy — Clean data leads to better predictions.
- Reduced Complexity — Simplifies data for analysis.
- Error Minimization — Prevents misleading results due to poor data quality.
- Better Interpretability — Makes outputs meaningful to domain experts.
⚙️ 2. Data Quality Dimensions¶
| Dimension | Description | Example Issue |
|---|---|---|
| Accuracy | Correctness of data | Age recorded as 250 |
| Completeness | Missing values or records | Null entries in income |
| Consistency | Uniform format or scale | “USA” vs “United States” |
| Timeliness | Up‑to‑date data | Using 2010 sales for 2025 analysis |
| Validity | Follows rules and constraints | Negative quantity sold |
| Uniqueness | Duplicate entries | Same customer twice |
graph LR
A[Data Quality] --> B[Accuracy]
A --> C[Completeness]
A --> D[Consistency]
A --> E[Timeliness]
A --> F[Validity]
A --> G[Uniqueness] 🧩 3. Major Tasks in Data Preprocessing¶
The main stages include:
graph TD
A[Data Preprocessing] --> B[Data Cleaning]
A --> C[Data Integration]
A --> D[Data Transformation]
A --> E[Data Reduction]
A --> F[Data Discretization] | Task | Description | Example |
|---|---|---|
| Data Cleaning | Handle noise, missing, inconsistent data | Replace nulls, smooth noise |
| Data Integration | Combine data from multiple sources | Join customer and sales tables |
| Data Transformation | Normalize or encode | Min‑max scaling, one‑hot encoding |
| Data Reduction | Reduce volume while maintaining integrity | PCA, feature selection |
| Data Discretization | Convert continuous to categorical | Income ranges: low/medium/high |
🧼 4. Data Cleaning¶
4.1 What is Data Cleaning?¶
Data Cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a dataset.
graph TD
A[Raw Data] --> B[Detect Issues]
B --> C[Handle Missing Values]
C --> D[Handle Noise]
D --> E[Remove Duplicates]
E --> F[Clean Dataset] 🚧 5. Handling Missing Values¶
5.1 Common Causes¶
- Human errors during data entry
- Sensor or transmission failures
- Non‑responses in surveys
5.2 Strategies¶
| Method | Description | Example |
|---|---|---|
| Ignore tuple | Remove records with missing values | Drop rows with NaN |
| Global constant | Fill with a fixed value | Replace null with “Unknown” |
| Mean/Median/Mode | Use statistical measures | Fill income with mean value |
| Predictive | Use model‑based imputation | Regression or k‑NN |
| Domain‑based | Use rules from domain | Age = avg for that region |
Manual Example¶
| Record | Age | Income |
|---|---|---|
| A | 25 | 60K |
| B | NaN | 50K |
| C | 30 | NaN |
Mean Age = (25+30)/2 = 27.5
Mean Income = (60+50)/2 = 55K
| Record | Age | Income |
|---|---|---|
| A | 25 | 60K |
| B | 27.5 | 50K |
| C | 30 | 55K |
🔊 6. Handling Noisy Data¶
6.1 Definition¶
Noisy data contains random errors, deviations, or outliers that distort patterns.
6.2 Noise Reduction Techniques¶
| Technique | Description | Example |
|---|---|---|
| Binning | Sort data and smooth using bins | Group ages 20‑29, 30‑39 |
| Regression | Fit model to remove deviations | Linear regression smoothing |
| Clustering | Group similar data | Remove far‑away outliers |
| Moving Average | Smooth temporal fluctuations | Stock price smoothing |
graph TD
A[Noisy Data] --> B[Binning]
A --> C[Regression Smoothing]
A --> D[Clustering]
A --> E[Moving Average]
B & C & D & E --> F[Smoother Dataset] Manual Example — Binning¶
Data: [5, 6, 7, 10, 12, 13, 20]
Sort and divide into 3 bins:
- Bin1: [5,6,7] → mean=6
- Bin2: [10,12,13] → mean=11.67
- Bin3: [20] → mean=20
Smoothed data: [6,6,6,11.67,11.67,11.67,20]
🧮 7. Data Cleaning as a Process¶
The cleaning process involves multiple iterative passes:
graph TD
A[Identify Data Issues] --> B[Select Cleaning Strategy]
B --> C[Apply Transformations]
C --> D[Verify Results]
D --> E[Repeat if Necessary]
E --> F[Store Clean Data] | Step | Action | Output |
|---|---|---|
| Identify | Detect missing/outlier values | Report summary |
| Strategy | Choose imputation or removal | Decision plan |
| Apply | Execute cleaning rules | Updated dataset |
| Verify | Compare stats pre/post cleaning | Quality metrics |
| Store | Save clean data for analysis | Ready data file |
📈 8. Summary Diagram¶
graph TD
A[Raw Data] --> B[Detect Issues]
B --> C[Handle Missing Values]
B --> D[Handle Noisy Data]
C --> E[Verify and Validate]
D --> E
E --> F[Clean Dataset]
F --> G[Integration & Transformation] 🧠 9. Key Takeaways¶
- Data preprocessing is vital for reliable modeling.
- Major steps include cleaning, integration, transformation, reduction, and discretization.
- Handling missing and noisy data is critical for data quality.
- Cleaning is an iterative and domain‑driven process.
📘 10. Exam Practice Questions¶
- What are the main reasons for preprocessing data?
- Describe at least three techniques for handling missing values.
- Explain the concept of noisy data with examples.
- Outline the stages of the data cleaning process.
- Why is iterative cleaning important in real‑world data science projects?