🧩 Chapter 11 — Data Integration¶

This chapter explains Data Integration, focusing on entity identification, redundancy and correlation analysis, tuple duplication, and data value conflict resolution.

🔗 1. What is Data Integration?¶

Data Integration is the process of combining data from multiple sources into a unified and consistent view.
It’s crucial for analytical systems that rely on data from heterogeneous sources (e.g., databases, files, APIs).

graph TD
    A[Source 1] --> E[Integrated Data]
    B[Source 2] --> E
    C[Source 3] --> E

Example¶

Integrating customer data from CRM, ERP, and web analytics systems.
Ensures consistent customer IDs, names, and contact details across all sources.

🧠 2. The Entity Identification Problem¶

Definition¶

The Entity Identification Problem (also called Record Linkage or Entity Resolution) occurs when the same real-world entity appears under different names or identifiers across datasets.

Example	System A	System B
Customer Name	“John Smith”	“J. A. Smith”
Customer ID	1452	CUST‑001452
Address	12 Oak Rd, NY	12 Oak Road, New York

The challenge is to determine whether these records refer to the same entity.

graph LR
    A[Record in Source 1] --> C[Entity Matching]
    B[Record in Source 2] --> C
    C --> D[Unified Entity Record]

Solutions¶

Rule‑based matching — exact or fuzzy string matching.
Machine Learning approaches — classify record pairs as “same/different.”
Unique key assignment — generate global identifiers.

🔁 3. Redundancy and Correlation Analysis¶

3.1 Redundancy¶

Redundancy refers to duplicate or repetitive data that doesn’t add new information.
It wastes storage and may bias statistical analysis.

Example	Redundant Data
Customer table	Repeated address field
Joined tables	Duplicate keys after merge

graph TD
    A[Data Source 1] --> B[Integration Process]
    C[Data Source 2] --> B
    B --> D[Redundant Attributes Found]
    D --> E[Remove or Merge Redundant Data]

3.2 Correlation Analysis¶

Correlation analysis helps identify relationships between variables to remove redundant attributes.

Mathematically, for attributes $X$ and $Y$:

\[ r_{XY} = \frac{\sum_i (x_i - \bar{x})(y_i - \bar{y})} {\sqrt{\sum_i (x_i - \bar{x})^2 \sum_i (y_i - \bar{y})^2}} \]

If $|r_{XY}|$ ≈ 1 → strong correlation → one attribute may be redundant.

X	Y	(X–mean)	(Y–mean)	Product
1	2	-1	-2	2
2	3	0	-1	0
3	4	+1	0	0
Mean	2	3	—	—
$$ r_{XY} $$				≈ 1

🧬 4. Tuple Duplication¶

Definition¶

Tuple duplication occurs when the same data record appears multiple times in a dataset, either within or across sources.

graph TD
    A[Source 1] --> D[Duplicate Detector]
    B[Source 2] --> D
    D --> E[Remove or Merge Duplicates]

Causes¶

Repeated data imports
Manual data entry
Join operations without distinct filters

Detection & Removal¶

Method	Description
Exact matching	Compare full tuples
Key-based	Check duplicate primary keys
Hash-based	Use hash functions for large datasets
Clustering	Group similar tuples

Example¶

ID	Name	City
1	John	NY
2	Alice	TX
1	John	NY

→ Duplicate tuple (ID=1).

SQL Example:

SQL
SELECT Name, City, COUNT(*) 
FROM Customers
GROUP BY Name, City
HAVING COUNT(*) > 1;

⚖️ 5. Data Value Conflict Detection and Resolution¶

Definition¶

Occurs when the same entity has different attribute values in different data sources.

Source	Customer	City
CRM	John Doe	“New York”
ERP	John Doe	“NYC”

Detection¶

Compare same‑entity attributes across sources.
Use string similarity (e.g., Levenshtein distance) or standardized dictionaries.

Resolution Strategies¶

Strategy	Description	Example
Rule-based	Apply business rules	Prefer ERP data for “City”
Confidence score	Assign trust levels to sources	Weight by reliability
Most recent	Use timestamp to pick latest	Choose data from latest update
Merge	Combine differing values	City = “New York (NYC)”

graph TD
    A[Conflicting Values] --> B[Detection Algorithm]
    B --> C[Apply Rules / Scoring]
    C --> D[Resolved Value]

🧩 6. Example Integration Scenario¶

graph LR
    A[CRM System] --> D[Entity Matching]
    B[ERP System] --> D
    C[Web Logs] --> D
    D --> E[Unified Customer View]
    E --> F[Clean and Consistent Data Warehouse]

Step‑wise Illustration¶

Step	Operation	Result
1	Match “John Smith” across systems	Entity identified
2	Remove duplicates	Unique records
3	Resolve city conflicts	“New York” unified
4	Remove redundant fields	Reduced schema
5	Store unified customer data	Ready for analytics

🧠 7. Summary Table¶

Task	Description	Techniques
Entity Identification	Match entities across sources	String similarity, ML classification
Redundancy Analysis	Detect duplicated data or attributes	Correlation, schema mapping
Tuple Duplication	Detect repeated records	Hashing, grouping, clustering
Conflict Resolution	Resolve conflicting values	Rule-based, timestamp, confidence weight

📘 8. Practice Questions¶

Define the Entity Identification Problem and explain its challenges.
How does correlation analysis help in reducing redundancy?
Explain tuple duplication with SQL and manual examples.
What are common strategies for data conflict resolution?
Why is integration critical in enterprise data warehouses?