🧩 Chapter 11 — Data Integration¶
This chapter explains Data Integration, focusing on entity identification, redundancy and correlation analysis, tuple duplication, and data value conflict resolution.
🔗 1. What is Data Integration?¶
Data Integration is the process of combining data from multiple sources into a unified and consistent view.
It’s crucial for analytical systems that rely on data from heterogeneous sources (e.g., databases, files, APIs).
graph TD
A[Source 1] --> E[Integrated Data]
B[Source 2] --> E
C[Source 3] --> E Example¶
- Integrating customer data from CRM, ERP, and web analytics systems.
- Ensures consistent customer IDs, names, and contact details across all sources.
🧠 2. The Entity Identification Problem¶
Definition¶
The Entity Identification Problem (also called Record Linkage or Entity Resolution) occurs when the same real-world entity appears under different names or identifiers across datasets.
| Example | System A | System B |
|---|---|---|
| Customer Name | “John Smith” | “J. A. Smith” |
| Customer ID | 1452 | CUST‑001452 |
| Address | 12 Oak Rd, NY | 12 Oak Road, New York |
The challenge is to determine whether these records refer to the same entity.
graph LR
A[Record in Source 1] --> C[Entity Matching]
B[Record in Source 2] --> C
C --> D[Unified Entity Record] Solutions¶
- Rule‑based matching — exact or fuzzy string matching.
- Machine Learning approaches — classify record pairs as “same/different.”
- Unique key assignment — generate global identifiers.
🔁 3. Redundancy and Correlation Analysis¶
3.1 Redundancy¶
Redundancy refers to duplicate or repetitive data that doesn’t add new information.
It wastes storage and may bias statistical analysis.
| Example | Redundant Data |
|---|---|
| Customer table | Repeated address field |
| Joined tables | Duplicate keys after merge |
graph TD
A[Data Source 1] --> B[Integration Process]
C[Data Source 2] --> B
B --> D[Redundant Attributes Found]
D --> E[Remove or Merge Redundant Data] 3.2 Correlation Analysis¶
Correlation analysis helps identify relationships between variables to remove redundant attributes.
Mathematically, for attributes \(X\) and \(Y\):
If \(|r_{XY}|\) ≈ 1 → strong correlation → one attribute may be redundant.
| X | Y | (X–mean) | (Y–mean) | Product |
|---|---|---|---|---|
| 1 | 2 | -1 | -2 | 2 |
| 2 | 3 | 0 | -1 | 0 |
| 3 | 4 | +1 | 0 | 0 |
| Mean | 2 | 3 | — | — |
| $$ r_{XY} $$ | ≈ 1 |
🧬 4. Tuple Duplication¶
Definition¶
Tuple duplication occurs when the same data record appears multiple times in a dataset, either within or across sources.
graph TD
A[Source 1] --> D[Duplicate Detector]
B[Source 2] --> D
D --> E[Remove or Merge Duplicates] Causes¶
- Repeated data imports
- Manual data entry
- Join operations without distinct filters
Detection & Removal¶
| Method | Description |
|---|---|
| Exact matching | Compare full tuples |
| Key-based | Check duplicate primary keys |
| Hash-based | Use hash functions for large datasets |
| Clustering | Group similar tuples |
Example¶
| ID | Name | City |
|---|---|---|
| 1 | John | NY |
| 2 | Alice | TX |
| 1 | John | NY |
→ Duplicate tuple (ID=1).
SQL Example:
⚖️ 5. Data Value Conflict Detection and Resolution¶
Definition¶
Occurs when the same entity has different attribute values in different data sources.
| Source | Customer | City |
|---|---|---|
| CRM | John Doe | “New York” |
| ERP | John Doe | “NYC” |
Detection¶
- Compare same‑entity attributes across sources.
- Use string similarity (e.g., Levenshtein distance) or standardized dictionaries.
Resolution Strategies¶
| Strategy | Description | Example |
|---|---|---|
| Rule-based | Apply business rules | Prefer ERP data for “City” |
| Confidence score | Assign trust levels to sources | Weight by reliability |
| Most recent | Use timestamp to pick latest | Choose data from latest update |
| Merge | Combine differing values | City = “New York (NYC)” |
graph TD
A[Conflicting Values] --> B[Detection Algorithm]
B --> C[Apply Rules / Scoring]
C --> D[Resolved Value] 🧩 6. Example Integration Scenario¶
graph LR
A[CRM System] --> D[Entity Matching]
B[ERP System] --> D
C[Web Logs] --> D
D --> E[Unified Customer View]
E --> F[Clean and Consistent Data Warehouse] Step‑wise Illustration¶
| Step | Operation | Result |
|---|---|---|
| 1 | Match “John Smith” across systems | Entity identified |
| 2 | Remove duplicates | Unique records |
| 3 | Resolve city conflicts | “New York” unified |
| 4 | Remove redundant fields | Reduced schema |
| 5 | Store unified customer data | Ready for analytics |
🧠 7. Summary Table¶
| Task | Description | Techniques |
|---|---|---|
| Entity Identification | Match entities across sources | String similarity, ML classification |
| Redundancy Analysis | Detect duplicated data or attributes | Correlation, schema mapping |
| Tuple Duplication | Detect repeated records | Hashing, grouping, clustering |
| Conflict Resolution | Resolve conflicting values | Rule-based, timestamp, confidence weight |
📘 8. Practice Questions¶
- Define the Entity Identification Problem and explain its challenges.
- How does correlation analysis help in reducing redundancy?
- Explain tuple duplication with SQL and manual examples.
- What are common strategies for data conflict resolution?
- Why is integration critical in enterprise data warehouses?