📊 Chapter 9 — Data Science Process¶

A comprehensive overview of the end‑to‑end Data Science Process, from prior knowledge to deployment.

🔁 1. Overview¶

The Data Science Process transforms raw data into actionable insights through systematic stages:

graph TD
    A[Prior Knowledge] --> B[Data Preparation]
    B --> C[Modeling]
    C --> D[Evaluation]
    D --> E[Deployment]

Each stage iteratively refines understanding and improves decision quality.

🧠 2. Prior Knowledge¶

Definition¶

Domain expertise, business context, and problem framing.
Hypotheses and expectations derived from experience or literature.

Tasks¶

Identify the objective (classification, regression, clustering, etc.).
Define target variable(s).
Select metrics relevant to success.

Example¶

Suppose you’re building a loan default prediction model. Prior knowledge includes: - Customer demographics, credit history, and income behavior. - Bank risk policies. - Economic indicators.

graph LR
    A[Domain Knowledge] --> B[Feature Selection]
    A --> C[Metric Design]
    A --> D[Assumption Setting]

🧹 3. Data Preparation¶

Definition¶

Converting raw, messy data into a structured and analyzable form.

Sub‑steps¶

Data Collection: from databases, APIs, logs, or files.
Data Cleaning: handle missing values, outliers, and duplicates.
Feature Engineering: derive new features, normalize, encode categorical variables.
Splitting: train, validation, and test sets.

graph TD
    A[Raw Data] --> B[Cleaning]
    B --> C[Feature Engineering]
    C --> D[Splitting]
    D --> E[Modeling Ready Dataset]

Example (by hand)¶

Step	Action	Example
Missing	Replace null income	`income = mean(income)`
Outliers	Cap extreme age values	`age = min(age, 80)`
Encoding	Convert gender	`male→1, female→0`

🧮 4. Modeling¶

Definition¶

Selecting and training algorithms to represent data patterns.

Types of Models¶

Problem	Model Family	Example
Classification	Logistic Regression, Decision Tree, SVM	Predict customer churn
Regression	Linear, Ridge, Lasso, Random Forest	Predict house price
Clustering	K‑Means, DBSCAN	Group customers
Dimensionality Reduction	PCA, t‑SNE	Visualization

graph LR
    A[Features] --> B[Train Model]
    B --> C[Evaluate on Validation Set]
    C --> D[Parameter Tuning]

Mathematical Foundation (example)¶

For linear regression:

\[ \hat{y} = \beta_0 + \sum_{j=1}^p \beta_j x_j \]

where \(\boldsymbol{\beta} = (X^TX)^{-1}X^Ty\)

📈 5. Evaluation¶

Purpose¶

Assess model performance and generalization.

Metrics¶

Type	Metrics
Classification	Accuracy, Precision, Recall, F1, ROC‑AUC
Regression	RMSE, MAE, \(R^2\)
Clustering	Silhouette Score, Davies‑Bouldin Index

Example Calculation¶

For predictions: \([0.9, 0.3, 0.8, 0.1]\), actual: \([1,0,1,0]\)
Threshold = 0.5 → Predictions = [1,0,1,0] → Accuracy = 100%.

graph TD
    A[Trained Model] --> B[Evaluation Metrics]
    B --> C[Compare Models]
    C --> D[Select Best Model]

🚀 6. Deployment¶

Definition¶

Integrating the trained model into a production environment.

Methods¶

Batch inference: periodic predictions (e.g., nightly).
Online inference: real‑time API endpoint.
Edge deployment: mobile or IoT model execution.

graph TD
    A[Best Model] --> B[API Service]
    B --> C[Application Integration]
    C --> D[User Feedback Loop]

Example Architecture¶

graph LR
    A[Model File #40; .pkl/.onnx #41;] --> B[Flask/FastAPI Service]
    B --> C[Docker Container]
    C --> D[Cloud Deployment <br> #40; AWS, Azure, GCP #41;]
    D --> E[Monitoring + Retraining]

🔄 7. Iteration and Continuous Learning¶

The process is cyclic — feedback from deployment improves earlier stages.

graph TD
    E[Deployment Feedback] --> D[Evaluation]
    D --> C[Modeling]
    C --> B[Data Preparation]
    B --> A[Prior Knowledge]

🧩 8. Summary Table¶

Stage	Key Actions	Tools
Prior Knowledge	Define objective, assumptions	Domain expertise, documentation
Data Preparation	Cleaning, transformation	Pandas, SQL, Excel
Modeling	Algorithm training	scikit‑learn, TensorFlow, PyTorch
Evaluation	Performance metrics	sklearn.metrics, visualization
Deployment	Serve models	Flask, FastAPI, Docker, AWS Sagemaker

🧠 9. Quick Exam Questions¶

Explain how domain knowledge impacts feature engineering.
What are the differences between batch and online deployment?
Which metric is suitable for imbalanced classification?
Describe a feedback loop in deployed systems.
Write the formula for a linear regression model.