๐ŸŽ“ Data Mining โ€” Midterm

5 classic Qs ยท 50 min ยท case-study
๐Ÿšจ ehaa Mode โ€” 2 Hour Power Study ๐Ÿšจ Everything you need to pass with 70+ โ€” simple language, more practice, less reading.

๐ŸŽฏ Read This First (2 minutes)

Your exam in 1 paragraph

5 classic questions (no multiple choice), 50 minutes total, papers taken at 40 min. Case-study style โ€” she gives you a situation, you explain what to do. The topics she said are important: KDD, data architecture (warehouse/mart/lake/mesh), algorithms, correlation & association, and the finance color-panels example.

๐Ÿ”ฅ If you only remember 6 things, remember these:

  1. KDD = 5 steps: Select โ†’ Clean โ†’ Transform โ†’ Mine โ†’ Interpret
  2. Data Warehouse = SIVT: Subject, Integrated, Non-Volatile, Time-variant
  3. 4 architectures: Warehouse (clean), Mart (small focused), Lake (raw everything), Mesh (every team owns their own)
  4. Data Mesh 4 principles: Domain ownership, Data-as-product, Self-serve, Federated governance
  5. Predictive (classify/regress โ€” known answer) vs Descriptive (cluster/associate โ€” find unknown)
  6. Finance colors (pixel panels) beat line charts because they show 16,000+ points at once and let you see patterns across 4 markets

โฐ 2-Hour Study Plan

TimeWhat to do
0:00โ€“0:20Read section "Like I'm 3" below โ€” get the big picture
0:20โ€“0:40Memorize section "The Big 6" โ€” these are almost guaranteed exam topics
0:40โ€“1:20Do ALL Flash Q&A โ€” cover the answer, guess, then check
1:20โ€“1:50Read Mini Case Studies โ€” this is the EXACT exam format
1:50โ€“2:00Read The Formula + Final Recap โ†’ walk in confident

๐Ÿ‘ถ Every Topic Explained Like You're 3

๐Ÿ“š Data Mining
Imagine a HUGE box full of toys mixed together. Data mining = digging through the box to find patterns. Like "every time I see a red car, there's a yellow truck next to it!" The computer finds patterns humans never noticed.
Real definition: The process of finding unknown patterns and relationships in large data to help a business make decisions.
๐Ÿง‘โ€๐Ÿณ KDD (Making a Sandwich)
KDD is like making a sandwich. 5 steps:
  1. Selection โ€” Pick what ingredients you want (data you need)
  2. Preprocessing โ€” Wash the lettuce, throw away bad tomatoes (clean bad data)
  3. Transformation โ€” Cut everything into pieces that fit the bread (format data)
  4. Data Mining โ€” Put it all together! (run algorithms)
  5. Interpretation โ€” Taste it, tell your friends how it is (show results)
Why it matters: Professor said "KDD is important." You will likely be asked to name and describe these 5 steps.
๐Ÿข Data Warehouse = Big Organized Storage Room
Imagine a BIG tidy room in a company. Everything is organized by topic (all customer stuff in one shelf, all products in another). You don't throw old stuff away โ€” you keep everything for 5โ€“10 years. Everyone in the company uses the SAME names for things.
SIVT = the 4 rules: Subject (by topic) ยท Integrated (same names) ยท V Non-Volatile (don't delete) ยท Time-variant (keep history).
๐Ÿ—ƒ๏ธ Data Mart = Small Room for One Team
The big warehouse is TOO BIG for one team to search in. So we make a small room that only has the stuff THEY need. The marketing team has their own small room. The finance team has their own small room. They all copy from the big warehouse.
3 types: Dependent (copies from warehouse โ€” BEST), Independent (makes its own โ€” RISKY, everyone gets different numbers), Hybrid (both).
๐Ÿž๏ธ Data Lake = Big Messy Pool
Imagine a huge lake where you throw EVERYTHING in โ€” toys, books, photos, videos, messy notes. No organization. You fish out what you need later and THEN organize it. Good because: you save everything. Bad because: if you're lazy it becomes a swamp โ€” dirty, smelly, can't find anything.
Key word: Schema-on-READ. Structure is decided when you READ the data, not when you put it in. Perfect for data scientists who want raw data.
๐Ÿ•ธ๏ธ Data Mesh = Every Team Has Their Own Toy Box
Big companies got tired of ONE team controlling ALL data. So now: payments team has their own toy box, customers team has their own, logistics team has their own. They all share with each other, but each team is boss of their own box. There are some family rules (don't break other kids' toys = privacy, security).
4 principles: (1) Domain ownership โ€” your team owns your data. (2) Data as a product โ€” treat your data like a real product with docs and quality. (3) Self-serve platform โ€” shared tools so anyone can build. (4) Federated governance โ€” shared rules.
๐ŸŽจ Pixel Visualization = Wall of Colored Squares
You have 16,000 stock prices. That's way too many to draw on a line chart (it becomes a mess). So instead, make each price a colored square. Dark/purple = price was LOW. Bright/green = price was HIGH. Now you have a wall of squares. When you see a big purple area โ†’ stocks were low for a long time. When you see a big green area โ†’ stocks were high. When all 4 walls (IBM, Dollar, Dow, Gold) change color at the same spot โ†’ big world event happened.
The trick: The Peano-Hilbert curve โ€” a snake path that keeps squares that are close in TIME also close in SPACE. That's why colors form blocks/patches.
๐ŸŽฏ Predictive vs Descriptive
Predictive = I ALREADY know the question. "Is this email spam? Will it rain tomorrow? What's the house price?" Computer learns from examples with answers.

Descriptive = I DON'T know what I'm looking for. Just group/explore/find patterns. "Put similar customers together. Find things bought together."
Examples: Predictive โ†’ Classification (spam), Regression (price), Time-series (forecast). Descriptive โ†’ Clustering (K-means), Association rules (beer+nappies), Anomaly (fraud).
๐Ÿบ The Beer & Nappies Story
Walmart looked at billions of receipts. The computer found: on Friday evenings, lots of people buy BEER and BABY DIAPERS (nappies) together. Why? New dads were sent to get nappies after work โ€” they grabbed beer too! Walmart put beer next to nappies in stores โ€” both sales went UP. This is association rules. Nobody asked the computer to check this โ€” it FOUND it.
Exam point: This example proves data mining finds things you weren't looking for. That's the whole point.
๐Ÿ“Š Statistics vs ML vs Data Mining
Statistics = "I THINK X is true. Let me check." (Top-down, starts with a guess)
Machine Learning = "Learn from examples so you can predict new stuff." (Algorithms)
Data Mining = "Look at all this data, find something useful for the business!" (Bottom-up, no starting guess โ€” and business value is the goal)
Key phrase: Data Mining is "bottom-up explorative". Statistics is "top-down confirmative". They help each other.
๐Ÿ“ Measuring Similarity (Distance)
How similar are 2 customers? Measure the "distance" between them.
  • Manhattan โ€” walk along streets, count blocks (sum of differences)
  • Euclidean โ€” fly in a straight line (Pythagoras formula) โ€” MOST COMMON
  • Cosine โ€” compare DIRECTION, not size (good for text documents)
Small distance = very similar. Big distance = different.
Minkowski is the general formula. When h=1 โ†’ Manhattan. When h=2 โ†’ Euclidean. When h=โˆž โ†’ Supremum.
๐ŸŽš๏ธ OLAP vs Data Mining
OLAP = I have a question, let me check. "Did sales drop last Monday?" I ask, computer checks.
Data Mining = Computer finds things I didn't ask. "Sales drop on rainy Mondays when there's football."
They are FRIENDS, not enemies. OLAP is used BEFORE data mining to explore.
Intelligence hierarchy: Query & Reporting (basic) โ†’ Data Retrieval โ†’ OLAP โ†’ Data Mining (smartest, hardest).

๐ŸŽฏ The Big 6 โ€” Memorize These Exactly

These 6 are the MOST likely to appear in the exam. If you memorize them word-for-word, you'll have material for at least 3 of the 5 questions.

1๏ธโƒฃ Definition of Data Mining (Giudici, 2003)

"The process of selection, exploration, and modelling of large quantities of data to discover regularities or relations that are at first unknown, with the aim of obtaining clear and useful results for the owner of the database."

2๏ธโƒฃ The 5 KDD Steps

Selection โ†’ Preprocessing โ†’ Transformation โ†’ Data Mining โ†’ Interpretation

Memory trick: "Silly People Throw Dirty Ice" (bad but memorable!)

3๏ธโƒฃ Data Warehouse SIVT (Inmon, 1996)

  • Subject-Oriented โ€” organized by Customer/Product, not by billing app
  • Integrated โ€” same encoding (M/F everywhere, not 1/0 here and Male/Female there)
  • V Non-Volatile โ€” never changed, just append new rows with timestamps
  • Time-Variant โ€” 5โ€“10 years of history with time dimension on every record

4๏ธโƒฃ Data Mesh 4 Principles (Dehghani, 2019)

  1. Domain Ownership โ€” teams own their own data
  2. Data as a Product โ€” SLAs, docs, quality, versioning
  3. Self-Serve Infrastructure โ€” shared platform for all teams
  4. Federated Governance โ€” global rules, local implementation

5๏ธโƒฃ 4 Architectures โ€” One-Line Each

  • Warehouse โ€” structured, clean, governed โ†’ use for reports & finance
  • Mart โ€” subset of warehouse โ†’ use for specific team's fast access
  • Lake โ€” raw everything, schema-on-read โ†’ use for ML & data science
  • Mesh โ€” decentralized, domain-owned โ†’ use for huge agile organizations

6๏ธโƒฃ Why Colored Pixels Beat Line Charts (Finance Example)

  • 16,350 data points โ€” line chart can only show ~1,000
  • Peano-Hilbert curve keeps time-close points space-close โ†’ stable periods become visible blocks
  • Green = high, Purple = low โ€” eye reads patterns instantly
  • 4 panels side-by-side (IBM, Dollar, Dow, Gold) reveal GLOBAL events when all 4 change color at the same region
  • A line chart cannot compare 4 markets visually at once

โšก Flash Q&A โ€” 30 Rapid Questions

Cover each answer with your hand. Try to answer aloud. Then check. If you get it wrong โ†’ read once and move on. Come back to it later.

Q1. What does KDD stand for?
Show answer
Knowledge Discovery in Databases.
Q2. List the 5 steps of KDD in order.
Show answer
Selection โ†’ Preprocessing โ†’ Transformation โ†’ Data Mining โ†’ Interpretation.
Q3. What does SIVT stand for?
Show answer
Subject-oriented, Integrated, Non-Volatile, Time-variant. The 4 characteristics of a data warehouse.
Q4. Who defined the data warehouse in 1996?
Show answer
William Inmon, called the "father of the data warehouse".
Q5. Name the 3 types of data mart.
Show answer
Dependent (from a warehouse โ€” best), Independent (from source systems โ€” risky), Hybrid (both).
Q6. What's the difference between schema-on-write and schema-on-read?
Show answer
Schema-on-write = structure decided BEFORE loading (warehouse โ€” strict). Schema-on-read = structure decided when READING (lake โ€” flexible).
Q7. What is a "data swamp"?
Show answer
A data lake that has no metadata, no governance, no cataloguing. Data is there but nobody can find or trust it.
Q8. List the 4 Data Mesh principles.
Show answer
Domain ownership, Data as a product, Self-serve infrastructure, Federated governance.
Q9. Who coined "Data Mesh" and when?
Show answer
Zhamak Dehghani, 2019.
Q10. What is OLAP?
Show answer
Online Analytical Processing. User-driven hypothesis testing using multidimensional cubes. Fails with hundreds of variables.
Q11. What's the intelligence hierarchy (lowest to highest)?
Show answer
Query & Reporting โ†’ Data Retrieval โ†’ OLAP โ†’ Data Mining.
Q12. Is data mining top-down or bottom-up?
Show answer
Bottom-up (explorative). Statistics is top-down (confirmative).
Q13. Name 3 predictive mining tasks.
Show answer
Classification, Regression, Time-series forecasting.
Q14. Name 3 descriptive mining tasks.
Show answer
Clustering, Association rules, Anomaly detection.
Q15. What is an association rule? Give the most famous example.
Show answer
A "if A then B" pattern in transactions. Famous example: Walmart found {diapers} โ†’ {beer} โ€” new fathers bought both on Friday evenings.
Q16. Name the 4 attribute types.
Show answer
Nominal (no order), Binary (2 values), Ordinal (has order), Numeric (measurable).
Q17. What is an outlier? Which measure of central tendency is robust to them?
Show answer
An outlier is an unusual/extreme value. The MEDIAN is robust to outliers (the mean is NOT โ€” one billionaire pulls the mean up).
Q18. What is IQR and how is it calculated?
Show answer
Interquartile Range = Q3 โˆ’ Q1. Measures the spread of the middle 50% of data.
Q19. Name the 4 visualization categories.
Show answer
Pixel-oriented, Geometric projection, Icon-based, Hierarchical.
Q20. What is the Peano-Hilbert curve? Why use it?
Show answer
A space-filling curve that visits every cell in a grid. Used in pixel-oriented viz because it keeps time-close points spatially close โ€” so stable periods become visible blocks.
Q21. In the finance example, what do green and purple colors mean?
Show answer
Green = HIGH values. Purple = LOW values.
Q22. What do the 4 finance panels show?
Show answer
IBM (stock), Dollar (currency), Dow Jones (index), Gold (commodity). Daily values from Jan 1987 to Mar 1993. 16,350 data points.
Q23. When all 4 panels change color at the same place, what does it mean?
Show answer
A GLOBAL EVENT โ€” a shock affecting all markets together. Could be financial crisis, oil shock, war, major policy change.
Q24. What's the formula for Minkowski distance with h=2? What's that called?
Show answer
โˆš((xโ‚โˆ’yโ‚)ยฒ + (xโ‚‚โˆ’yโ‚‚)ยฒ + ... + (xโ‚šโˆ’yโ‚š)ยฒ) โ€” this is EUCLIDEAN distance (straight-line).
Q25. When is Cosine similarity preferred over Euclidean?
Show answer
For sparse, high-dimensional data like text documents โ€” where DIRECTION matters more than magnitude. A 10-word doc and a 1000-word doc about the same topic can still have cosine similarity = 1.
Q26. What's the difference between classification and clustering?
Show answer
Classification = SUPERVISED, has known labels (spam/not-spam). Clustering = UNSUPERVISED, groups are discovered from data without labels.
Q27. Name the 7 phases of the data mining process (Giudici).
Show answer
A. Define Objectives โ†’ B. Select & Organize Data โ†’ C. Exploratory Analysis โ†’ D. Specify Methods โ†’ E. Data Analysis โ†’ F. Evaluate & Compare โ†’ G. Implement Decisions.
Q28. What is metadata and why does it matter?
Show answer
"Data about data" โ€” describes how fields are defined, when they changed, where data came from. Without it, a warehouse is "a library without a catalogue" โ€” nobody can find or trust anything.
Q29. Name 3 real companies that use ALL 4 architectures simultaneously.
Show answer
Netflix, Spotify, Walmart. Each uses warehouse + mart + lake + mesh for different data types and use cases.
Q30. What's the "Virtuous Circle of Knowledge"?
Show answer
Berry & Linoff (1997): Data mining โ†’ strategic decision โ†’ creates new measurement needs โ†’ new business needs โ†’ new data โ†’ more mining. It's a self-reinforcing feedback loop.

๐Ÿ“ Mini Case Studies โ€” The Exam Format

These are the EXACT format your exam will use. She gives you a situation โ†’ you apply concepts โ†’ you recommend/explain. Practice these!

Case 1: A hospital has 10 years of patient records. Names are spelled differently ("John Smith", "J. Smith", "Smith, John"). Dates are in different formats. How do you handle this before data mining?
Your answer:

This is a Preprocessing problem โ€” step 2 of KDD. Steps to take:

  1. Data cleansing โ€” standardize names using fuzzy matching (identify "John Smith" = "J. Smith")
  2. Format normalization โ€” convert all dates to one format (ISO: YYYY-MM-DD)
  3. Handle missing values โ€” either fill with defaults, or drop incomplete records
  4. De-duplicate โ€” remove or merge duplicate records
  5. Apply integration rules โ€” like a data warehouse would (subject-oriented view of "Patient")
Verdict: Preprocessing (KDD step 2). Without this, any mining algorithm will give confidently wrong answers.
Case 2: An online retailer has: (a) 5 years of clean sales transactions, (b) 50 million unstructured product reviews, (c) real-time clickstream data from their website. Design an architecture strategy.
Your answer:

Use ALL the architectures together:

  • Data Warehouse for clean sales transactions. Structured, needed for financial reporting, governance-critical. Apply SIVT characteristics.
  • Data Lake for the 50 million unstructured product reviews (text data) AND the real-time clickstream. Both are unstructured/semi-structured, need flexibility, for data scientists to mine (sentiment analysis, recommendation).
  • Data Marts built from the warehouse for specific teams: Marketing mart, Finance mart, Inventory mart. Dependent marts ensure consistency.
  • Consider Data Mesh principles if the company has many autonomous product teams that need to publish their own data products โ€” but apply federated governance.
Verdict: Hybrid architecture โ€” warehouse + lake + marts, with mesh principles if the org is large enough.
Case 3: A bank CEO asks: "We have our data all in one warehouse. The compliance team, fraud team, marketing team, and risk team all complain it's too slow and complex. What do you recommend?"
Your answer:

The problem isn't the warehouse โ€” it's that 4 teams are all fighting over the same complex resource. Recommend creating 4 dependent data marts (one per team):

  • Compliance Mart โ€” regulatory data, audit trails
  • Fraud Mart โ€” transaction patterns, risk scores
  • Marketing Mart โ€” customer demographics, campaign data
  • Risk Mart โ€” exposure, credit scores

Each team gets FAST, FOCUSED access to only the data they need. Because marts are dependent (derived from the central warehouse), all teams get consistent numbers. If the warehouse says "revenue = X", all marts inherit this โ€” no "multiple versions of truth" problem.

Verdict: Keep the warehouse + build 4 dependent data marts.
Case 4: A finance professor shows you colored pixel panels of IBM, Dollar, Dow Jones, and Gold from 1987-1993. She asks: "Why not just use a simple line chart?"
Your answer (THE PROFESSOR'S EXACT QUESTION):
  1. Data density: 16,350 daily values โ€” a line chart can only handle ~1,000 points before becoming unreadable. The pixel viz shows all 16,350 at once.
  2. Temporal locality preserved: The Peano-Hilbert curve puts time-close points spatially close. Stable periods appear as colored blocks the eye instantly sees.
  3. Pattern visibility: Green clusters = bull markets (high prices). Purple clusters = bear markets (low prices). Hard to see on a crowded line chart.
  4. Cross-variable visibility: 4 panels side-by-side. A line chart with 4 overlapping lines becomes noise. With panels, you compare at a glance.
  5. Global event detection: If all 4 markets (stock, currency, index, commodity) change color in the same region โ†’ SYSTEMIC SHOCK (e.g. 1987 Black Monday). A line chart cannot reveal this.
Verdict: Colored pixel panels compress massive data into visible structure. They reveal stable periods, volatility, and global events that a simple chart cannot show.
Case 5: A supermarket wants to know what products are frequently bought together. What type of mining is this? Which algorithm? Give a famous example.
Your answer:

This is Descriptive mining, specifically Association Rule mining (a local method in Giudici's classification). Algorithm: Apriori (classic) or FP-Growth (faster for big data).

It finds rules of the form {A, B} โ†’ {C} with support and confidence measures. "Support" = how often the items appear together. "Confidence" = if A is bought, how likely is C also bought?

Famous example: Walmart discovered {diapers} โ†’ {beer} โ€” on Friday evenings, new fathers bought both. Walmart moved beer displays next to diapers; sales of both rose. This is the canonical example of value from association rule mining.

Verdict: Descriptive + Association Rules + Apriori algorithm + beer-nappies example.
Case 6: A startup wants to predict which customers will stop using their app (churn). They have 2 years of user activity logs. What's your plan?
Your answer: Walk through KDD:

  1. Selection: Extract relevant data โ€” login frequency, last activity date, features used, support tickets, in-app purchases, demographics.
  2. Preprocessing: Clean missing values, standardize formats, remove test accounts.
  3. Transformation: Engineer features like "days since last login", "activity trend last 30 days", normalize numeric features.
  4. Data Mining: This is Predictive Classification (binary: will churn / won't churn). Try Decision Tree (interpretable), Random Forest (accurate), Logistic Regression (baseline).
  5. Interpretation: Compare models on held-out data. Deploy best model to CRM. High-risk customers get a retention email.

Monitor after deployment โ€” the virtuous circle of knowledge kicks in as new churn outcomes feed back to retrain the model.

Verdict: Supervised classification, KDD-based plan, Decision Tree or Random Forest algorithm.

๐Ÿงช The Universal Answer Formula

For ANY case-study question on the exam, use this 4-step formula. It works every time.

1
DEFINE the key term from the question. (1-2 sentences from the glossary.) Example: "A Data Warehouse is an integrated, subject-oriented, non-volatile, time-variant collection of data (Inmon, 1996)."
2
APPLY the concept to the specific situation she gives. "In this bank's case, the clean transaction data clearly belongs in a warehouse because..."
3
GIVE an example or use a real-world case. "This is like Walmart's warehouse โ€” they integrated POS data from every store..."
4
STATE trade-offs, risks, or alternatives. "However, a risk is the data swamp problem if governance is weak. An alternative would be..."
๐ŸŽฏ Bonus structure tips
  • Use bullet points when listing things โ€” easier to mark, clearer to read.
  • Name-drop the authors: Inmon (warehouse), Dehghani (mesh), Berry & Linoff (virtuous circle), Giudici (definition) โ€” shows you read the material.
  • Use the exam vocabulary: SIVT, KDD, OLAP, schema-on-read, federated governance, etc. โ€” even if a simple word would work.
  • If stuck, list pros/cons: You'll get partial marks for showing you understand the trade-offs.
  • Don't leave blanks: A partial answer gets partial marks. An empty answer gets zero.

๐Ÿ“ Sample: Applying the Formula to a Real Question

Question: "An insurance company has 4 different teams, each building their own data mart directly from source systems. Board meetings keep ending in arguments about the numbers. What's happening and what should they do?"

Step 1 โ€” DEFINE: Data marts are thematic, subject-specific subsets of data used by specific business teams. There are 3 types: dependent (from a warehouse), independent (from source systems), and hybrid. This company is using independent marts.

Step 2 โ€” APPLY: Each of the 4 independent marts has its own ETL pipeline and its own interpretation of business rules. So when the board asks "what were our claims last quarter?", each team gives a different answer based on their mart's definitions. This creates the classic "multiple versions of truth" problem.

Step 3 โ€” EXAMPLE: Imagine one team calculates "revenue" as gross sales, another uses net of refunds, a third uses post-commission. All call their number "revenue". Like building 4 different rulers and arguing about who's taller.

Step 4 โ€” FIX + TRADE-OFFS: Convert to dependent marts derived from a central data warehouse. Define "revenue" once in the warehouse; all marts inherit. Trade-off: slower to build initially, requires warehouse investment. But long-term: single source of truth, consistent board numbers, regulatory compliance. This is the Inmon top-down approach.

โœ… Full 4-step answer โ€” probably 90%+ marks.

๐Ÿ”ฅ Final 15-Minute Recap

Read this as the LAST thing before walking into the exam.

๐ŸŽ“ Top-of-mind words to drop in any answer

KDD ยท SIVT ยท Subject-oriented ยท Integrated ยท Non-Volatile ยท Time-Variant ยท Schema-on-read ยท Schema-on-write ยท Federated governance ยท Domain ownership ยท Data-as-a-product ยท Dependent mart ยท Data swamp ยท OLAP ยท Peano-Hilbert curve ยท Virtuous circle of knowledge ยท Bottom-up explorative ยท Top-down confirmative ยท Association rule ยท Supervised ยท Unsupervised ยท Inmon ยท Dehghani ยท Giudici ยท Berry & Linoff

๐Ÿšจ The 5 things most likely to be asked
  1. Explain KDD / the 5 steps of knowledge discovery
  2. Compare/recommend among Warehouse, Mart, Lake, Mesh (case study)
  3. Explain SIVT characteristics of a warehouse
  4. Why colored pixel panels beat a simple line chart (finance example)
  5. Predictive vs Descriptive mining + give examples (beer-nappies)
โš ๏ธ Tricky vocabulary that might confuse you
  • Covariate = feature = variable = attribute โ€” all the same thing (a column)
  • Supervised = predictive = asymmetrical = direct
  • Unsupervised = descriptive = symmetrical = indirect
  • Non-volatile = doesn't change. Volatile = changes.
  • Exogenous = from outside. Data retrieval uses exogenous criteria (user decides).
  • Explorative = bottom-up (data mining). Confirmative = top-down (statistics).
โœ… Exam time strategy (40 min of writing)
  1. Minute 0โ€“3: Read ALL 5 questions. Start with the easiest to build confidence.
  2. Minute 3โ€“35: ~6โ€“7 minutes per question. Use the 4-step formula.
  3. Minute 35โ€“40: Review. Fix typos. Add one more example where you have space.
  4. If you blank: start defining the key term in the question. Writing unlocks memory.
  5. If running out of time: switch to bullet points. Partial credit > nothing.
๐Ÿ€ You've got this! 2 hours of focused study here beats 8 hours of panicking. Walk in confident.

โšก Quick Exam Cheat Sheet READ FIRST

๐ŸŽฏ What your professor said (from classmates' notes)
  • 5 classic questions (not MCQ) ยท 50 minutes ยท papers collected at 40 min
  • Case-study format โ€” she'll give a situation, you explain / decide / apply
  • Focus topics she named:
    • Database Management System of Data Mining (Warehouse / Mart / Lake / Mesh)
    • Algorithms (Decision Tree, Neural Net, K-Means, Apriori, etc.)
    • KDD โ€” "is important"
    • Correlation and Association
    • Finance example โ€” "why would I want to use a picture with colors instead of a simpler one?"
  • Slides 9+ of the Week 4/5 deck โ€” she said she won't print the colorful panels (no color printer), but she may still ask about them
  • So: understand the colored finance visualization even though you won't see it on paper

๐Ÿ“Œ The 10 things you must know cold

  1. KDD = Knowledge Discovery in Databases. 5 steps: Selection โ†’ Preprocessing โ†’ Transformation โ†’ Data Mining โ†’ Interpretation.
  2. Data Mining definition: process of selection, exploration, and modelling of large data to discover unknown regularities, with the aim of producing results useful for the database owner (business advantage).
  3. Statistics vs Data Mining: Statistics is top-down (confirm hypothesis). Data Mining is bottom-up (discover unknown patterns).
  4. Data Warehouse SIVT: Subject-oriented, Integrated, Non-Volatile, Time-Variant (Inmon, 1996).
  5. Data Mart types: Dependent (from warehouse โ€” best), Independent (from sources โ€” risky silos), Hybrid (both).
  6. Data Lake: raw data, any format, schema-on-READ. Risk: becomes a "data swamp" without governance.
  7. Data Mesh 4 principles: Domain ownership, Data-as-a-product, Self-serve platform, Federated governance.
  8. Intelligence hierarchy: Query & Reporting โ†’ Data Retrieval โ†’ OLAP โ†’ Data Mining (capacity grows, difficulty grows).
  9. Core mining tasks: Predictive (classification, regression, time-series) and Descriptive (clustering, association rules, anomaly detection).
  10. Finance pixel viz: 16,350 data points shown via Peano-Hilbert curve. Green = high, purple = low. Cross-panel color sync = global market event.
โœ… Strategy for the 40-minute writing window

Spend the first 3 minutes reading ALL 5 questions. Then give each question ~7 minutes. Write in bullet form if time is short โ€” examiners still give marks. Always DEFINE the key term first, then APPLY it to the case.

๐Ÿ“˜ Week 1 โ€” Introduction to Data Mining W1

1.1 What is Data Mining?

Data Mining is the computational process of discovering patterns, correlations, anomalies, and insights in large datasets using statistical, mathematical, and machine learning techniques. It is also called Knowledge Discovery in Databases (KDD).

Automated โ€” finds patterns without manual hypothesis Scalable โ€” works on massive, complex data Actionable โ€” turns patterns into real decisions
computational process
A series of steps done by a computer.
pattern
A repeating structure or trend in the data (e.g. "people who buy bread also buy milk").
anomaly
Something unusual or different from the normal pattern. Also called outlier.
insight
A useful understanding or realization hidden in the data.

1.2 The KDD Process (5 steps)

Each step feeds into the next. If your data quality is bad at step 2, your final insights at step 5 will also be bad. ("Garbage in, garbage out.")

1Data Selection โ€” Identify relevant data sources and extract the target dataset (which tables, which columns, which time period).
2Preprocessing โ€” Handle missing values, noise, and inconsistencies (e.g. "M"/"F" vs "Male"/"Female" โ†’ standardize).
3Transformation โ€” Feature engineering, normalization, dimensionality reduction (convert data into shapes algorithms can use).
4Data Mining โ€” Apply algorithms (Decision Tree, Neural Net, K-Means, etc.) to discover patterns and build models.
5Interpretation / Evaluation โ€” Evaluate results, visualize, communicate findings to decision-makers.
๐ŸŽฏ Exam tip โ€” KDD is listed as "important" by the professor Memorize the 5 step names AND the order. For a case-study question, be ready to place the situation into the right KDD step. E.g. "a hospital has dirty patient records with typos" โ†’ that's a Preprocessing problem (step 2).

1.3 Core Tasks in Data Mining

๐Ÿ”ฎ Predictive Mining

Predicts a known target variable. Also called supervised, asymmetrical, direct.

  • Classification โ€” assign to categories (spam vs not-spam)
  • Regression โ€” predict numeric value (house price)
  • Time Series โ€” forecast future from past (stock price)

๐Ÿ” Descriptive Mining

Finds unknown structure in data. Also called unsupervised, symmetrical, indirect.

  • Clustering โ€” group similar items (no labels)
  • Association Rules โ€” co-occurrence (beer + nappies)
  • Anomaly Detection โ€” spot outliers (fraud)
Key vocabulary the professor mentioned: "correlation and association"

Correlation = statistical relationship between two variables (e.g. temperature โ†‘ โ†’ ice cream sales โ†‘).

Association rule = a pattern "if A, then B often happens too" found in baskets/transactions. Famous example: {diapers} โ†’ {beer} (Walmart discovered young fathers bought both on Friday evenings).

1.4 Key Algorithms โ€” when to use each

AlgorithmTypeWhen to use
Decision TreesPredictive (classif./regr.)Tabular data, need interpretable rules, easy to explain to managers
Neural NetworksPredictiveComplex patterns (images, text), when you have lots of data
SVM (Support Vector Machine)Predictive (classification)Small/medium datasets, high-dimensional features
Random ForestPredictive (ensemble)Robust, high accuracy, general-purpose workhorse
K-MeansDescriptive (clustering)Group customers/products, need fast unsupervised grouping (sensitive to choice of k)
AprioriDescriptive (association)Market basket analysis, find "A โ†’ B" rules

1.5 Real-World Applications

Healthcare โ€” predicting patient readmission, drug discovery, disease outbreak detection.
Finance โ€” fraud detection, credit scoring, algorithmic trading, risk assessment.
Retail โ€” recommendation engines, customer segmentation, demand forecasting.
Education โ€” learning analytics, early warning systems, adaptive paths.
Manufacturing โ€” predictive maintenance, quality control, supply-chain optimization.
Social Media โ€” sentiment analysis, trend detection, user behavior modeling.

1.6 Challenges & Ethics

Technical challenges

Ethical considerations

๐Ÿ“— Week 2 โ€” Foundations & Process W2

2.1 Formal Definition (Giudici, 2003)

Memorize this definition โ€” it often appears in exams

"Data mining is the process of selection, exploration, and modelling of large quantities of data to discover regularities or relations that are at first unknown, with the aim of obtaining clear and useful results for the owner of the database."

Three key ideas:

๐Ÿ’ก The "Virtuous Circle of Knowledge" (Berry & Linoff, 1997)

The strategic decision from data mining creates new measurement needs โ†’ new business needs โ†’ new data โ†’ new analysis. It's a self-reinforcing feedback loop.

2.2 Statistics vs Machine Learning vs Data Mining

DimensionStatisticsMachine LearningData Mining
Primary GoalTest hypothesesReproduce data-generating processExtract business value
Data TypePrimary, experimentalAny, often largeSecondary, from warehouses
ApproachTop-down (confirmative)Generalization from examplesBottom-up (explorative)
ModelsSingle reference modelMultiple competing modelsMultiple models, chosen by data fit
RoleAcademic, testingPrediction & classificationDecision support & strategy
top-down / confirmative
Start with a hypothesis, then test if data supports it.
bottom-up / explorative
Look at the data first, then see what patterns emerge.
primary data vs secondary data
Primary = collected by you for your purpose (experiment). Secondary = already exists in company databases for other reasons (transaction logs).

2.3 Knowledge Discovery in Databases (KDD) โ€” History

1980s โ€” Database Marketing
Machine learning methods first used beyond computer science, for targeted marketing campaigns on large customer databases.
Early 1990s โ€” KDD Coined
Term "Knowledge Discovery in Databases" invented to describe all methods that find relations/regularities in observed data.
1995 โ€” First KDD Conference (Montreal)
Usama Fayaad formalized the term at the first international conference on KDD.
Post-1995
"Data Mining" used for the component of KDD where learning algorithms are applied. Later becomes a synonym for the whole process.

2.4 Data Mining & Computing โ€” The Intelligence Hierarchy

There is a trade-off: the more information a tool gives you, the harder it is to implement.

1Query & Reporting โ€” Retrieve and display data. Lowest information capacity. Easiest to implement. Answers "what happened?"
2Data Retrieval โ€” Extract by pre-specified criteria (e.g. "all customers who bought A AND B"). Criteria are exogenous โ€” decided outside the data.
3OLAP (Online Analytical Processing) โ€” User tests hypotheses graphically using multidimensional hypercubes. Works with small variable sets. Fails with 100+ variables.
4Data Mining โ€” Discovers unknown relations automatically. Highest information capacity. Hardest to implement.
โš ๏ธ OLAP vs Data Mining โ€” a common exam question

OLAP tests user hypotheses (user says "let me check if sales drop on Mondays"). Data Mining uncovers unknown relations ("the algorithm found that sales drop on rainy Mondays with football matches"). They are complementary, not competitors โ€” OLAP is often used in the preprocessing stages of data mining.

2.5 Data Mining vs Statistics โ€” Key Differences

Berry and Linoff (1997) distinguish two approaches:

In practice the two are complementary. Confirmative tools can check the discoveries made by exploratory mining.

Statistical criticisms of data mining (historical)

Statisticians once called it "data fishing", "data dredging", "data snooping":

  1. No single theoretical reference model โ€” many models compete. Criticism: always possible to find some model that fits.
  2. The great amount of data may lead to non-existent relations being found.

Modern data mining addresses this by focusing on generalization: when choosing a model, predictive performance is prioritized and complex models are penalized.

2.6 The 7-Phase Data Mining Process

ADefine Objectives โ€” Set clear, measurable business aims. Most critical phase. Determines everything downstream.
BSelect & Organise Data โ€” Identify sources, create data marts, perform data cleansing.
CExploratory Analysis โ€” Preliminary analysis similar to OLAP. Detect anomalies, transform variables.
DSpecify Methods โ€” Choose descriptive, predictive, or local methods based on objectives.
EData Analysis โ€” Apply algorithms. Use multiple methods to highlight different data aspects.
FEvaluate & Compare โ€” Compare methods. Choose final model considering time, resources, data quality.
GImplement Decisions โ€” Integrate model into business process. Generate a decision engine.

Four Implementation Phases (for integrating mining into the company)

  1. Strategic โ€” study business procedure, identify where mining adds value.
  2. Training โ€” run a pilot project, assess results.
  3. Creation โ€” if pilot works, plan full reorganization.
  4. Migration โ€” train users, integrate, evaluate continuously.

2.7 Three Classes of Data Mining Methods

ClassAlso calledGoalExamples
DescriptiveSymmetrical / Unsupervised / IndirectDescribe groups of data briefly. Classify into groups not known beforehand.Association methods, log-linear, graphical models, clustering
PredictiveAsymmetrical / Supervised / DirectDescribe one variable in relation to others. Classification or prediction rules.Neural networks, decision trees, linear regression, logistic regression
Localโ€”Identify characteristics in subsets of the database.Association rules, anomaly/outlier detection

2.8 Organisation of the Data โ€” Pipeline

Why it matters: the way you analyze data depends on how the data is organised. Efficient analysis requires valid organisation. The analyst must be involved when setting up the database.

๐Ÿ“Š Standard analytical pipeline

Data Warehouse โ†’ Data Webhouse (web-adapted) โ†’ Data Mart (thematic) โ†’ Data Matrix (ready for analysis).

data matrix
A tabular structure (rows = objects, columns = attributes) ready for statistical analysis and modelling.
metadata
"Data about data" โ€” information about how the data is organised (what each field means, when it changed, who owns it). Increases reliability and security.

๐Ÿ“™ Week 3 โ€” Data Architecture Deep Dive W3

๐ŸŽฏ Professor said "Database Management System of Data Mining" is an exam topic This entire Week 3 (Warehouse, Mart, Lake, Mesh) is HIGHLY LIKELY to appear. Case-study: "A company has situation X โ€” what architecture do you recommend?"

3.1 Why Data Organisation Matters

Before any mining algorithm runs, data must be accessible, consistent, and complete. Poor organisation creates three failures:

  1. Missing Information โ€” Operational databases (billing, orders) are built for transactions, not analysis. Fields analysts need may simply not exist.
  2. Noise & Inconsistency โ€” Duplicates, spelling variations ("Ltd" vs "Limited"), inconsistent date formats (DD/MM/YYYY vs MM/DD/YYYY). Algorithms treat noise as signal and produce confidently wrong answers.
  3. Inaccessible Silos โ€” Data scattered across CRM, ERP, marketing platform, ticketing system. Analysts spend time finding data, not mining it.
silo (data silo)
A dataset trapped inside one department/system, not shared with others. Eliminating silos means breaking barriers so departments share data as a unified resource.
noise
Random errors or meaningless variations in data. Machine learning algorithms cannot tell noise from signal without help.

3.2 Deep Dive โ€” The Data Warehouse

"A data warehouse is an integrated collection of data about a collection of subjects (units), which is not volatile in time and can support decisions taken by management." โ€” Inmon (1996)

The 4 SIVT Characteristics โ€” MEMORIZE

S
Subject-Oriented
Organised around Customers, Products, Sales โ€” NOT around the billing/invoice applications that created the data.
I
Integrated
Unifies many sources with consistent naming, encoding, units. E.g. "gender": M/F, 1/0, Male/Female โ†’ all unified.
V
Non-Volatile
Data is loaded and never modified. When a customer's address changes, a new row is added with a timestamp. Past is preserved.
T
Time-Variant
Typically holds 5โ€“10 years of history (vs 60โ€“90 days in operational systems). Every record has a time dimension.

Data Warehouse Architecture (layers)

1Source Systems โ€” ERP, CRM, POS, web analytics, IoT sensors.
2ETL Layer โ€” Extract, Transform, Load. Most time-consuming part.
3Data Warehouse โ€” Central integrated store.
4Data Marts โ€” Subject-specific subsets.
5Analytics Layer โ€” BI tools, OLAP, mining, ML.

Two philosophies for building a warehouse

โฌ‡๏ธ Top-Down (Inmon)

  • Build enterprise warehouse FIRST
  • Derive marts from warehouse after
  • Quality enforced centrally
  • Slow โ€” 1โ€“3 years before first mart
  • Common in regulated industries
  • Risk: high upfront cost

โฌ†๏ธ Bottom-Up (Kimball)

  • Build individual data marts FIRST
  • Join marts into warehouse later
  • Quality enforced per mart
  • Fast โ€” first mart in weeks
  • More popular overall
  • Risk: inconsistency across marts
๐Ÿ›’ Famous Example โ€” Walmart's beer & nappies discovery

Walmart's warehouse stored 2.5 petabytes, processed 1M+ transactions/hour, connected 100,000+ suppliers. Analysts mined the warehouse and found: beer and nappies (diapers) are frequently bought together on Friday evenings โ€” new fathers on emergency supply runs. Walmart moved beer displays next to nappies, and sales of both products rose.

Lesson: No analyst would have thought to test this hypothesis. The warehouse made it possible to discover patterns from billions of transactions. This is the essence of data mining: finding what you were not looking for.

3.3 Deep Dive โ€” Data Marts

A data mart is a thematic, subject-specific subset of a data warehouse โ€” smaller, faster, built for a specific team.

Metaphor: If a warehouse is a large central library, a mart is a specialist reading room stocked with only books relevant to one department.

Three Types of Data Mart

TypeSourceProsCons
Dependent (gold standard)From a central warehouseConsistent, governed. Changes propagate everywhere.Requires warehouse to exist first.
IndependentDirectly from source systemsFast to set up.Creates silos. 20 marts = 20 different definitions of "revenue" โ†’ boardroom chaos.
HybridBoth warehouse AND direct sourcesFlexible (e.g. clean history + real-time sensors).More complex, potential inconsistency.
๐Ÿฆ Example โ€” Insurance company with 4 marts

From one warehouse, 4 dependent marts:

  • Customer Mart โ€” demographics, policy history โ†’ CRM team
  • Claims Mart โ€” claims events, fraud scores โ†’ actuaries
  • Risk Mart โ€” risk scores, exposure โ†’ underwriters
  • Finance Mart โ€” P&L by product line โ†’ CFO's team

All 4 draw from ONE warehouse. Actuaries' claims numbers and Finance's claims costs must match โ€” because they come from the same source.

3.4 Deep Dive โ€” Data Lake

๐Ÿž๏ธ The Lake Metaphor

Imagine a vast lake. Rivers flow in from everywhere โ€” clean mountain streams (structured data), murky runoff (web logs), sediment-laden rivers (unstructured text). Everything coexists. You decide what to fish out, when you need it, using the right equipment.

A data lake stores all data โ€” structured, semi-structured, unstructured, streaming โ€” in its raw, unprocessed form.

What flows into a Data Lake (4 types)

Schema-on-WRITE vs Schema-on-READ

ConceptWarehouse (Schema-on-Write)Lake (Schema-on-Read)
When schema definedBefore data is loadedAt query time, by user
TransformationDuring ETL, before loadingOn demand, when reading
FlexibilityLow โ€” fixed at designHigh โ€” same data, many uses
Data qualityHigh (enforced at write)Variable โ€” raw may be messy
Speed to set upSlow to set up, fast queriesFast to set up, slow queries
Who defines meaningData engineers, upfrontData scientists, at analysis time
schema
The structure of data โ€” what columns exist, what types they are, what values are allowed.
schema-on-write
Structure is decided BEFORE data enters. Strict, clean โ€” but inflexible.
schema-on-read
Structure is decided WHEN you read the data. Flexible โ€” but quality varies.
โš ๏ธ The Data Swamp Warning

Many organizations built lakes enthusiastically, then abandoned them within two years. The failure mode is the data swamp: massive data with no metadata, no cataloguing, no access controls, no governance. Analysts can't find what they need. Sensitive data sits unprotected. The lake becomes an expensive, unusable swamp.

Lesson: a lake needs as much governance as a warehouse โ€” just applied at read time instead of write time.

๐ŸŽต Example โ€” Spotify's Data Lake

Spotify: 600M+ users, 100B+ streams/day, 4PB+ new data/day. Stores every play, skip, pause, search, playlist event, plus raw audio waveforms, blog descriptions, acoustic features.

Mining this lake โ†’ Spotify Wrapped: every December, each user gets a personalized year summary. In 2023, Wrapped generated more social media engagement than any paid ad campaign Spotify ran. The data lake became marketing gold.

3.5 Deep Dive โ€” Data Mesh

The problem Data Mesh solves

Warehouses, marts, and lakes all assume data is collected centrally by one data team. For large organizations with dozens of business units, this central team becomes a bottleneck: teams queue for weeks, by the time data is ready the question has changed.

The 4 Principles of Data Mesh (Zhamak Dehghani, 2019) โ€” MEMORIZE

1
Domain Ownership
Each business domain (payments, logistics, customers) owns its own data. No central gatekeeper. The people who create the data are responsible for its quality.
2
Data as a Product
Each domain's data has owners, SLAs, documentation, quality standards, versioning โ€” treated exactly like a software product.
3
Self-Serve Infrastructure
A shared platform makes it easy for any domain to build, host, and publish data products independently (pipelines, catalog, access control, monitoring).
4
Federated Governance
Global rules (privacy, security, GDPR compliance) enforced centrally. Domains choose HOW to implement; they cannot opt out. Autonomy + standards.
mesh (why the name)
A network where every node connects to every other node โ€” no central hub. Any team can discover and consume any other team's data product through a shared catalogue, like a marketplace.
SLA (Service Level Agreement)
A formal promise about how reliable a service is (uptime, freshness, response time).
bottleneck
A narrow point where work gets stuck โ€” like a bottle's neck. When one team becomes a bottleneck, everyone else is blocked.

3.6 Case Study โ€” Netflix (uses ALL 4 simultaneously)

Netflix is perfect teaching example because NO single architecture serves all its needs:

Why can't Netflix pick ONE architecture?

Different tasks have different requirements for freshness, speed, quality, governance, flexibility. No single architecture satisfies all requirements at once. Regulatory financial reporting can't run from a raw lake; millisecond personalization can't query a slow warehouse.

3.7 Architecture Comparison Table

DimensionWarehouseMartLakeMesh
Data formatStructured onlyStructured onlyAny โ€” rawAny โ€” domain decides
SchemaWrite-time (strict)Write-timeRead-time (flexible)Varies by domain
Primary usersAnalysts, execsDept. analystsData scientistsDomain teams
Update freq.Batch (hours/days)BatchNear real-timeDomain-managed
ScaleTBโ€“PBGBโ€“TBPB+Unlimited, distributed
GovernanceCentralised, strictCentralisedOften weakFederated
Best forBI & reportingDept. analyticsML, explorationLarge agile orgs
Main riskExpensive, slowProliferationData swampGovernance complexity

3.8 Decision Guide โ€” Which Architecture?

๐Ÿ’ก The Key Insight

These 4 architectures are not mutually exclusive. Every large, mature organization uses ALL 4 simultaneously โ€” assigning different data to the right architecture for its use case. The question is never "which single one?" but "which fits this specific data and need?"

๐Ÿ“• Week 4โ€“5 โ€” Data Characterization & Visualization W4-5

4.1 Data Objects and Attribute Types

A data object (also called an observation) represents an entity (e.g. a customer, a product). Each object is described by attributes (also called covariates, features, or variables).

TypeDescriptionExamples
NominalCategorical, NO orderHair color, Occupation, Country
BinaryNominal with only 2 statesYes/No, 0/1, Male/Female
OrdinalCategorical WITH meaningful orderEducation level, Size (S/M/L), Ratings (โ˜…โ˜…โ˜…)
NumericMeasurable quantityTemperature, Income, Age

Numeric sub-types

covariates / features / variables / attributes
All the same thing โ€” the columns in your data table. Each describes one property of an object.

4.2 Basic Statistical Descriptions

Central Tendency โ€” "What is the typical value?"

Mean (arithmetic average)
Sum of all values รท N
Sensitive to outliers โ€” one billionaire in the room pulls the average up.
Median (middle value)
The middle value when data is sorted.
Robust to outliers โ€” preferred for skewed data (e.g. income, house prices).
Mode (most frequent)
The value that appears most often.
Can be multimodal (more than one mode).
Midrange
Average of max + min. ((max + min) / 2)
Rarely used alone โ€” very sensitive to extremes.

Dispersion โ€” "How spread out is the data?"

Five-Number Summary & Boxplot

The five-number summary is: Min, Q1, Median, Q3, Max. It's visualized using a boxplot:

outlier
A value far outside the normal range. Can be an error, or can be an interesting anomaly (fraud, a rare event).
skewed data
Data where values aren't symmetric around the average. A few very large values drag the mean away from the median.
robust
Not affected by unusual values. The median is robust; the mean is not.

4.3 Data Visualization โ€” 4 Main Categories

TechniqueDescriptionExamples
Pixel-OrientedMaps attribute values to colored pixels. Maximizes data on screen.Space-filling curves, Circle segments
Geometric ProjectionProjects high-dimensional data into lower-dimensional views.Scatter-plot matrix, Parallel coordinates
Icon-BasedEncodes attributes as features of small icons.Chernoff faces, Stick figures
HierarchicalPartitions dimensions into subspaces, displays nested.Tree-maps, Worlds-within-Worlds (n-Vision)
Chernoff face
An icon where face features (eyes, mouth, nose shape) encode different variables. Humans are good at reading faces, so patterns jump out.
parallel coordinates
Each variable gets a vertical axis; each data point becomes a line crossing them. Good for seeing high-dimensional patterns.
tree-map
Nested rectangles whose size represents a value. Good for hierarchical data (e.g. filesystem sizes, market sectors).

4.4 Pixel-Oriented Visualization โ€” IN DEPTH

๐ŸŽฏ Slides 9+ of Week 4/5 โ€” PROFESSOR'S KEY HINT This is the "picture with colors" she mentioned. Exam could ask WHY we use this over simpler charts. Read this section 3 times!

Core concept

A normal line chart can show maybe 100โ€“1,000 data points. A pixel-oriented visualization does something clever: every single pixel on the screen = one data value. So instead of 1,000 points, you can show tens of thousands at once.

Think: each tiny square = one observation.

The ordering problem โ€” Space-Filling Curves

If we place pixels randomly, the picture becomes meaningless. We need a smart ordering that keeps similar values spatially close. The trick: use a space-filling curve โ€” a continuous path that visits every point in a square exactly once, without jumping.

The most famous is the Peano-Hilbert curve. It looks like a twisty, recursive snake that fills a square completely. Its key property: nearby positions on the curve = nearby positions in time/data order. This preserves local structure.

Practical analogy: Take a long time-series list and instead of drawing it as a flat line, "fold" it into a square by bending it along the Hilbert curve. The data stays ordered, but is now packed into a compact square image.

Color encoding

(Exact colors depend on the scale used for each variable.)

LOW
HIGH

๐Ÿ’ฐ Finance Example โ€” IBM / Dollar / Dow Jones / Gold HIGHLY EXPECTED

๐ŸŽฏ Professor specifically mentioned this example "Why would I want to use a picture with colors instead of a simpler one?" โ€” She said she couldn't print color, but asked about it. Understanding this example is almost certainly worth exam marks.

5.1 The Dataset

A stock exchange database with 16,350 data items, covering 4 financial variables recorded daily from January 1987 to March 1993:

IBM
DOLLAR
DOW JONES
GOLD.US$
LOW
HIGH

Approximate reconstruction โ€” 4 panels, each using the Peano-Hilbert curve to show one variable over 6 years of daily data.

For each trading day, 9 separate values are recorded. Each panel is a separate Peano-Hilbert visualization of one variable.

5.2 How to interpret the graphic

Why does it look like blocks?

These blocks (patches) correspond to periods of similar market behavior โ€” stable conditions or prolonged trends.

Reading the patterns โ€” the 3 KEY interpretation rules

Rule 1: Large purple area (e.g. IBM panel) = prolonged LOW prices

Why? Combines two properties:

  • Color: purple = low values
  • Hilbert property: nearby pixels = nearby time

So a large connected purple region means MANY consecutive time points ALL with low values โ†’ prolonged low-price period.

Rule 2: Bright green region (e.g. Dow Jones) = strong market period

Same logic reversed:

  • Green = high values
  • A cluster of green pixels = many high values close in time

Implies: a period where the index was consistently high โ†’ strong market.

Rule 3: Multiple panels change color at the same region = possible GLOBAL event

Why? The 4 panels represent 4 different market categories:

  • Stock (IBM)
  • Index (Dow Jones)
  • Currency (Dollar)
  • Commodity (Gold)

If they ALL shift at the same time, different markets moved together โ€” this is consistent with a common external shock: financial crisis, policy change, oil shock, war, geopolitical event.

5.3 WHY use colored panels instead of a simpler chart?

โœ… Model answer to the professor's question

Five key reasons:

  1. Data density โ€” A line chart can show ~1,000 points before it becomes an unreadable mess. The pixel-oriented view shows 16,350 values clearly on one screen, one per pixel.
  2. Temporal locality is preserved โ€” The Peano-Hilbert curve keeps points close in time also close in space. So we can SEE stable periods as colored blocks.
  3. Stable periods become visible โ€” Large same-color areas = prolonged stable market conditions. Hard to see on a crowded line chart.
  4. Shocks and volatility jump out โ€” Sudden color transitions are obvious to the eye.
  5. Cross-variable patterns are visible โ€” With 4 panels side-by-side, simultaneous color shifts across panels reveal global events affecting multiple markets at once โ€” something line charts can't easily show with 4 overlapping lines.

One-sentence summary: Colored pixel panels compress massive time-series into visible structure, revealing stable periods, volatility, and cross-market co-movements that a simple line chart cannot show because of limited screen space and no way to compare 4 variables simultaneously.

5.4 What to look for

๐Ÿ“ Measuring Similarity & Dissimilarity W5

6.1 Why this matters

Many mining algorithms (clustering, nearest-neighbor, recommendation) need to answer: how similar is object A to object B? For that we need a distance measure.

6.2 Two fundamental data structures

6.3 Minkowski Distance โ€” the general formula

Distance between two objects x and y in p dimensions:

D(x, y) = ( |xโ‚ โˆ’ yโ‚|สฐ + |xโ‚‚ โˆ’ yโ‚‚|สฐ + ... + |xโ‚š โˆ’ yโ‚š|สฐ )^(1/h)

Special cases of Minkowski

h valueNameIdeaWhen to use
h = 1Manhattan / City BlockSum of absolute differences along each axisGrid-like movement, e.g. streets in Manhattan
h = 2EuclideanStraight-line distance (Pythagoras!)Default for continuous numeric data
h โ†’ โˆžSupremum / ChebyshevMaximum absolute difference across dimensionsWhen worst-case deviation matters
๐Ÿง  Remember visually

Two points (1,1) and (4,5):

  • Manhattan: |4โˆ’1| + |5โˆ’1| = 3 + 4 = 7 (walk along streets)
  • Euclidean: โˆš((4โˆ’1)ยฒ + (5โˆ’1)ยฒ) = โˆš(9+16) = โˆš25 = 5 (fly in straight line)
  • Supremum: max(|4โˆ’1|, |5โˆ’1|) = max(3, 4) = 4 (biggest single gap)

6.4 Cosine Similarity โ€” for sparse/text data

Cosine similarity measures the angle between two vectors, not their magnitude. Ideal for high-dimensional sparse data like text documents (where most values are 0).

sim(A, B) = cos(ฮธ) = (A ยท B) / (||A|| ร— ||B||)

Direction-sensitive, not magnitude-sensitive. A 10-word doc and a 1000-word doc about the same topic have high cosine similarity even though their lengths differ hugely.

sparse data
Data with mostly zero values. Text documents are sparse โ€” most words from the vocabulary don't appear in any given document.
vector
An ordered list of numbers. Each object can be represented as a vector (one number per attribute).
orthogonal
At a right angle (90ยฐ). In similarity terms, "completely unrelated".

6.5 Quick-reference summary

MeasureBest ForNotes
EuclideanContinuous numeric dataDefault choice, uses magnitude
ManhattanGrid-like or when axes differLess sensitive to outliers than Euclidean
CosineText, sparse high-dim dataIgnores magnitude, uses angle
SupremumWorst-case scenariosOnly largest difference matters

๐Ÿ“– English Vocabulary Glossary TERMS

Key English words you might encounter in the exam questions. Memorize these meanings.

Algorithm
A step-by-step procedure to solve a problem (a recipe for a computer).
Anomaly / Outlier
An unusual value that doesn't fit the normal pattern.
Association rule
A pattern "if A, then B often happens". Example: {diapers} โ†’ {beer}.
Attribute
A property of an object (a column in a data table). Same as feature, variable, covariate.
Bottleneck
A narrow point where work gets blocked. "Central data team is a bottleneck" = everyone waits for them.
Case study
A detailed real-world example used to illustrate a concept. Your exam uses this format.
Classification
Assigning items to predefined categories (spam / not-spam).
Clustering
Grouping similar items together WITHOUT predefined labels.
Concept drift
When patterns in data change over time (e.g., consumer behavior shifted during COVID).
Confirmative
Top-down approach: test a known hypothesis.
Correlation
A statistical relationship between two variables.
Curse of dimensionality
Problem: as you add more variables, data becomes sparse and analysis harder.
Data Lake
Raw data storage, any format, schema-on-read.
Data Mart
Thematic subset of a data warehouse, built for one team/purpose.
Data Mesh
Decentralized architecture: domains own and publish their data as products.
Data Warehouse
Integrated, subject-oriented, non-volatile, time-variant store for analytics (SIVT).
Data matrix
Rows = objects, columns = attributes. Standard format for analysis.
Data swamp
A data lake that has become unusable due to no governance.
Descriptive mining
Finds unknown structure (unsupervised). Clustering, association, anomaly detection.
Dimensionality reduction
Reducing the number of variables while keeping information.
Dispersion
How spread out the data is (variance, standard deviation, IQR).
ETL
Extract, Transform, Load โ€” the pipeline from source systems into a warehouse.
Exogenous
Coming from OUTSIDE. "Exogenous criteria" = decided by the analyst, not the data.
Explorative
Bottom-up approach: search data for unknown patterns.
Federated governance
Central rules enforced, but local autonomy in HOW to follow them.
Generalization
A model's ability to work on new, unseen data (not just the training data).
Hierarchy
A ranked order from lowest to highest.
Hypothesis
A proposed explanation to be tested. Plural: hypotheses.
Inmon
William Inmon, "father of the data warehouse", defined SIVT (1996).
IQR
Interquartile Range = Q3 โˆ’ Q1. Middle 50% of data.
KDD
Knowledge Discovery in Databases. The 5-step process from data to insight.
Kimball
Ralph Kimball, bottom-up warehouse approach: build marts first, then combine.
Local methods
Methods focused on subsets of data (e.g., association rules, outlier detection).
Metadata
Data about data. Describes how fields are defined, when they changed, etc.
Non-volatile
Once written, not changed. New versions are added, not overwritten.
Noise
Random errors, duplicates, inconsistencies in data.
OLAP
Online Analytical Processing. User tests hypotheses via multidimensional cubes.
Orthogonal
At 90ยฐ. In similarity, "completely unrelated".
Patch
A connected region (like a patch on clothing). In pixel viz: same-color block.
Peano-Hilbert curve
A space-filling curve that visits every cell of a grid while keeping nearby-in-time points nearby in space.
Pipeline
A series of processing stages where output of one stage is input of the next.
Predictive mining
Predicts a known target variable (supervised). Classification, regression, time series.
Quartile
A value that divides sorted data into quarters. Q1=25%, Q2=median, Q3=75%.
Regression
Predicting a continuous numeric value (e.g., price).
Robust
Not affected by unusual values or small errors.
Schema
Structure of data (columns, types, rules).
Silo
Isolated data trapped in one system/department.
SIVT
Subject-oriented, Integrated, Non-Volatile, Time-Variant. The 4 data warehouse characteristics.
SLA
Service Level Agreement โ€” a formal promise of reliability.
Sparse data
Mostly zeros. Text document vectors are sparse.
Supervised / Unsupervised
Supervised = has labels/target. Unsupervised = no labels, find structure.
Symmetric / Asymmetric method
Symmetric = descriptive (unsupervised). Asymmetric = predictive (supervised, has a target variable).
Time-series
Data ordered in time (stock prices, sensor readings).
Virtuous circle of knowledge
Berry & Linoff (1997): mining โ†’ decision โ†’ new data needs โ†’ more mining.
Volatility
How much values change / jump around. Stock volatility = price swings.

โœ๏ธ Practice Exam โ€” Case-Study Questions PRACTICE

๐Ÿ“ How to use this section

Below are 18 practice questions in the format your professor uses โ€” classic (not MCQ), case-study style. Try to answer each one in your head (or on paper) BEFORE opening the answer. Tap each question to reveal the model answer. This is the most important part of your preparation.

Track your progress:
HIGH PRIORITYKDDQ1. Define Data Mining and explain the 5 steps of the KDD process. For each step give one real example of what could go wrong.

Definition: Data Mining is the process of selection, exploration, and modelling of large quantities of data to discover regularities or relations that are at first unknown, with the aim of obtaining clear and useful results for the owner of the database. It is also known as Knowledge Discovery in Databases (KDD).

The 5 KDD steps:

  1. Data Selection โ€” identify relevant data sources and extract the target dataset.
    What can go wrong: picking the wrong date range, e.g. using only 2020 COVID-year data to predict "normal" customer behavior.
  2. Preprocessing โ€” handle missing values, noise, inconsistencies.
    What can go wrong: failing to standardize "M"/"Male"/"male" โ†’ the algorithm treats them as 3 different categories.
  3. Transformation โ€” feature engineering, normalization, dimensionality reduction.
    What can go wrong: not scaling variables so one feature (e.g. income in thousands) dominates another (age 0โ€“100).
  4. Data Mining โ€” apply algorithms to discover patterns.
    What can go wrong: choosing a neural network for a small dataset when a decision tree would give an interpretable answer.
  5. Interpretation / Evaluation โ€” evaluate, visualize, communicate findings.
    What can go wrong: presenting a correlation as causation, or reporting an accuracy that hides poor performance on rare classes.

Key point: each stage feeds the next โ€” data quality at early stages directly affects the value of the final knowledge. (Garbage in, garbage out.)

HIGH PRIORITYARCHITECTUREQ2. CASE STUDY: A mid-sized retail bank has: a 10-year Oracle data warehouse with clean account data, 8 million unstructured customer support email threads stored as text files, real-time ATM sensor logs, and 4 teams all queuing for one 6-person data engineering team. Recommend an architecture strategy.

Step 1 โ€” Assign each data type to the right architecture:

  • Account/transaction data โ†’ Keep the Data Warehouse. Structured, governance-critical for regulatory reporting (Basel III, GDPR). Don't throw away 10 years of clean data.
  • 8M customer support emails โ†’ Data Lake. Unstructured text, schema-on-read, needs NLP and exploratory analysis (sentiment, topic modelling, fraud language detection).
  • Real-time ATM sensor logs โ†’ Data Lake with streaming ingestion (e.g. Apache Kafka). Real-time patterns require stream processing, not batch SQL. Supports fraud detection and predictive maintenance.
  • Bottleneck of 4 teams fighting 1 data team โ†’ Move toward Data Mesh principles: each team (retail, wealth, fraud, credit risk) becomes a domain owning its own data product, with a shared self-serve platform.

Step 2 โ€” Add Data Marts for the 4 teams: dependent marts built from the warehouse give each team focused, fast access without exposing the whole warehouse.

Step 3 โ€” The governance caveat: Banks are heavily regulated. Data Mesh principle 4 (federated governance) is non-negotiable here. Central standards on privacy, access control, and regulatory compliance must be enforced even while domains own their pipelines.

Final recommendation: A hybrid architecture using all 4 patterns simultaneously โ€” warehouse for structured governed data, lake for unstructured/streaming, marts for team-level analytics, mesh principles to remove the bottleneck under strict federated governance.

HIGH PRIORITYFINANCE VIZQ3. A finance lecturer shows you 4 colored pixel panels displaying IBM stock, Dollar rate, Dow Jones, and Gold, over 6 years. Why is this visualization preferred over a simpler line chart? What could you learn from it?

Why colored pixel panels beat a simple line chart โ€” 5 reasons:

  1. Data density: The dataset has 16,350 daily values. A line chart becomes unreadable beyond ~1,000 points; a pixel-oriented view shows every value โ€” one pixel per observation.
  2. Temporal locality preserved: The Peano-Hilbert curve places nearby-in-time pixels nearby in space, so stable periods appear as colored blocks that eye can instantly see.
  3. Pattern visibility: Prolonged low/high periods become large single-color patches (purple = low, green = high). Sudden color shifts highlight volatility and shocks.
  4. Cross-variable comparison: 4 panels side-by-side let you spot simultaneous color shifts โ€” something a line chart with 4 overlapping lines hides in clutter.
  5. Global-event detection: When Stock + Index + Currency + Commodity all change color at the same region, it signals a systemic shock (financial crisis, oil shock, policy change, war).

What you could learn from it:

  • Periods of market stability (large same-color patches)
  • Bull markets (green clusters in Dow Jones)
  • Bear markets or low-price periods (purple clusters in IBM or Dow)
  • Correlations between markets โ€” do IBM and Dow move together? Does Gold rise when Dollar falls?
  • Global shocks โ€” 1987 Black Monday or the early 1990s recession could appear as synchronized color transitions across all 4 panels

One-line summary: Simple line charts cannot show 16,000+ points or compare 4 variables visually at once; pixel panels with the Hilbert curve compress this into a visual form the human eye can read in seconds.

W2Q4. Explain the differences between Statistics, Machine Learning, and Data Mining. Why are they said to be "complementary"?

Core differences:

AspectStatisticsMachine LearningData Mining
Primary goalTest hypotheses, explain phenomenaReproduce data-generating process; predict new casesExtract business value from data
ApproachTop-down (confirmative)Generalization from examplesBottom-up (explorative)
DataPrimary, experimental, curatedAny, often large-scaleSecondary, observational, from warehouses
Model useSingle reference model guided by theoryMultiple competing modelsMultiple models, chosen by data fit
DomainAcademic research, testingPrediction, classificationDecision support, strategy

Why complementary:

  • Statistics provides the theoretical grounding and rigor for methods data mining uses.
  • Machine learning provides many of the algorithms (neural nets, decision trees) that data mining applies.
  • Data mining expands scope and application โ€” turning them toward business advantage.
  • Explorative bottom-up mining can generate hypotheses; confirmative top-down statistics can then test them. Each enhances the other.

Modern data mining explicitly borrows from both: it uses ML algorithms but applies statistical generalization principles (penalizing overly complex models) to avoid the historical critique of "data fishing".

HIGH PRIORITYWAREHOUSEQ5. Define a data warehouse. Explain Inmon's 4 SIVT characteristics with an example of each from a bank context.

Definition (Inmon, 1996): A data warehouse is an integrated collection of data about a collection of subjects, which is not volatile in time and can support decisions taken by management.

The 4 SIVT characteristics:

  1. Subject-Oriented (S)
    Organised around business subjects (Customer, Product, Transaction), NOT around the applications that created the data.
    Bank example: The billing application is organised around invoices. The warehouse reorganises this around Customers โ€” because that's what analysts study.
  2. Integrated (I)
    Unifies data from many sources using consistent naming, encoding, and measurement conventions.
    Bank example: The retail system codes gender as M/F, the mortgage system uses 1/0, legacy systems use "Male"/"Female". The warehouse unifies all three into one consistent encoding.
  3. Non-Volatile (V)
    Data is loaded once and never modified. Changes are recorded as new rows with timestamps.
    Bank example: When a customer moves from London to Manchester, the operational CRM updates the record. The warehouse adds a new row with the new address, preserving the old address with its timestamp โ€” so you can always see where the customer lived on any past date.
  4. Time-Variant (T)
    Stores 5โ€“10 years of history (operational DBs only keep 60โ€“90 days). Every record has a time dimension.
    Bank example: The warehouse lets analysts compare Q3 2023 mortgage performance with Q3 2015 โ€” trend analysis, seasonality, long-range forecasting. Operational systems have thrown that data away.

Why these properties matter: Together SIVT makes a warehouse trustworthy, consistent, and historically complete โ€” the three prerequisites for reliable analytics and regulatory reporting.

DATA MARTQ6. Distinguish between Dependent, Independent, and Hybrid data marts. An insurance company has independent marts for claims, underwriting, and finance โ€” and board meetings always end in arguments over numbers. Explain why.

The 3 types:

  • Dependent mart โ€” derived directly from a central data warehouse. The warehouse is the single source of truth. Changes in business rules propagate automatically. This is the gold standard.
  • Independent mart โ€” built directly from source systems, bypassing the warehouse. Faster to set up but creates its own silo with its own ETL, its own definitions, and its own quality problems.
  • Hybrid mart โ€” pulls from both warehouse and source systems. Adds flexibility (e.g., clean history + real-time streaming) but increases complexity and inconsistency risk.

Why the board arguments happen:

With independent marts, each team built its own extraction from source systems with its own interpretation of the business rules. So:

  • The Claims mart calculates "total claims paid" one way (e.g., gross of reinsurance).
  • The Underwriting mart calculates it differently (net of reinsurance recoveries).
  • The Finance mart calculates it yet a third way (including loss-adjustment expenses).

When the CFO asks "What were our claims costs last quarter?" she gets 3 different numbers from 3 marts, all "correct" by their own definitions. The boardroom debates whose number to believe instead of making decisions.

The fix: Convert to dependent marts derived from a single enterprise data warehouse. Define "claims cost" once in the warehouse; let all marts inherit that definition. The numbers then agree by construction.

DATA LAKEQ7. What is a Data Lake? Explain "schema-on-read" vs "schema-on-write". What is a "data swamp" and how do you prevent it?

Data Lake: a centralized repository that stores all structured, semi-structured, unstructured, and streaming data at any scale in its raw, unprocessed format.

Schema-on-Write (Warehouse) โ€” the data's structure is defined before it is loaded. ETL cleans and standardizes. Data quality is enforced at write time. Rigid but trustworthy.

Schema-on-Read (Lake) โ€” data is stored raw. Structure is applied at query time by whoever reads it. The same raw data can be interpreted many ways for different use cases. Flexible but quality varies.

The Data Swamp Problem: A data swamp is a lake that has ingested vast amounts of data with no metadata, no cataloguing, no access controls, no governance. The symptoms:

  • Analysts cannot find what they need
  • Nobody knows which version of a dataset is current
  • Sensitive personal data sits unprotected
  • The lake becomes an expensive, unusable mess

Prevention โ€” 4 things:

  1. Metadata & data catalog โ€” every dataset must be documented (what it is, where it came from, who owns it). Tools: Apache Atlas, AWS Glue Data Catalog, Unity Catalog.
  2. Access controls โ€” role-based permissions, especially for personal/sensitive data.
  3. Quality monitoring โ€” automated checks for freshness, completeness, schema drift.
  4. Data governance policies โ€” lifecycle rules, retention, privacy compliance (GDPR).

Key insight: A lake needs as much governance as a warehouse. The governance is just applied at read time instead of write time. Skipping governance is the root cause of every data swamp.

DATA MESHQ8. A global tech company has 40 business units, each waiting weeks for a 15-person central data team to build pipelines. Is Data Mesh the answer? Explain the 4 principles and one risk.

Short answer: Yes, Data Mesh directly addresses this bottleneck โ€” provided the company invests in a shared platform and governance.

The 4 Principles of Data Mesh (Zhamak Dehghani, 2019):

  1. Domain Ownership โ€” each business domain (payments, logistics, customers) owns, manages, and is accountable for its own data. The people who create the data (and understand it best) should own it. No more central gatekeeper.
  2. Data as a Product โ€” each domain's data is treated like a software product: clear owners, SLAs for uptime and freshness, documentation, quality standards, versioning. Designed with consumers (other teams) in mind.
  3. Self-Serve Data Infrastructure โ€” a shared platform abstracts complexity away. Templates for pipelines, cataloguing tools, access control, quality monitoring, compute infrastructure. Enables autonomy without chaos.
  4. Federated Computational Governance โ€” global standards (privacy, security, GDPR, compliance) defined and enforced at enterprise level. Domains choose HOW to implement but cannot opt out. Autonomy WITH shared rules.

Why it fits this case: 40 units queuing for a 15-person team is a textbook bottleneck. With mesh, each unit publishes its own data products through a shared catalogue. Any team can discover and consume any other team's data โ€” like a marketplace, no central hub.

The main risk โ€” governance complexity: Decentralization without standards creates anarchy. If each of 40 units defines "active customer" differently, the company ends up with the same "multiple versions of truth" problem that independent marts create. Principle 4 (federated governance) is non-negotiable. For regulated industries (banking, healthcare), this is especially critical โ€” you cannot outsource GDPR or HIPAA compliance to domain teams.

Other risks: cultural change (domain teams need new skills), upfront platform investment, possible duplication of pipelines across domains.

HIERARCHYQ9. Explain the "intelligence hierarchy" โ€” Query & Reporting, Data Retrieval, OLAP, Data Mining. Why is Data Mining at the top? How are OLAP and Data Mining related?

The hierarchy (lowest to highest information capacity):

ToolWhat it doesCapacityDifficulty
Query & ReportingRetrieve and display data in tables/reportsLowestEasiest
Data RetrievalExtract by pre-specified criteria ("all customers who bought A and B")Lowโ€“MediumEasy
OLAPUser tests hypotheses via multidimensional hypercubesMediumMedium
Data MiningDiscovers unknown relations automaticallyHighestHardest

Why Data Mining is at the top: The other tools answer questions you already know to ask. Data Mining discovers patterns you didn't know to look for. It brings together all variables in different ways, finding unknown relations. That's the highest information capacity โ€” but also the hardest to implement, requiring skilled teams and high-quality data.

The trade-off: there is an inverse relationship between information capacity and ease of implementation. A simple query is quick but tells you only "what happened"; data mining takes months/years but reveals "what patterns exist that we never suspected".

OLAP vs Data Mining โ€” complementary, not competing:

  • OLAP tests user-driven hypotheses ("let me check sales by region by quarter"). The user asks the questions.
  • Data Mining uncovers unknown relations automatically. The data reveals the questions.
  • OLAP is often used in the preprocessing stages of data mining โ€” to understand the data, identify special cases, spot principal interrelations before applying mining algorithms.
  • Used together they create useful synergies โ€” OLAP scopes what to mine, mining reveals what OLAP should investigate further.

Important detail: OLAP fails with tens or hundreds of variables โ€” the hypothesis space becomes too complex for a human to navigate. That's where data mining's automation becomes essential.

MINING TASKSQ10. Distinguish between predictive and descriptive data mining. Give 2 real-world examples of each. Where does "association rules" fit, and why is the beer-and-nappies story famous?

Predictive (Asymmetrical / Supervised / Direct):
Describes one or more target variables in relation to others. There is a known answer to predict.

  • Classification โ€” categorize into known classes. Example: Gmail classifying emails as spam or inbox.
  • Regression โ€” predict a continuous value. Example: Zillow estimating house prices from features.
  • Time-series forecasting โ€” predict future from past. Example: forecasting next month's electricity demand.

Descriptive (Symmetrical / Unsupervised / Indirect):
Describes groups of data more briefly. Observations classified into groups not known beforehand; variables connected through association or graphical models.

  • Clustering โ€” group similar items without labels. Example: a marketing team clustering customers into personas (bargain hunters, loyalists, premium buyers).
  • Association rules โ€” find "if A, then B" co-occurrence patterns. Example: Netflix viewers who watched Stranger Things also watched Dark.
  • Anomaly detection โ€” identify unusual observations. Example: credit card fraud detection.

Where "association rules" fit: They are a local method โ€” a descriptive technique that identifies characteristics in subsets of the database (not the whole dataset). Giudici classifies them as local, alongside outlier detection.

Why the beer-and-nappies story is famous:

  • Walmart mined billions of transactions from their 2.5 petabyte warehouse.
  • Discovered: beer + nappies bought together on Friday evenings.
  • Investigation revealed new fathers on "emergency supply runs".
  • Walmart moved beer displays next to nappies โ†’ sales of both rose.

The lesson: No analyst would have thought to test the beer-nappy hypothesis. The unsupervised association-rule mining found a pattern nobody was looking for. This is the essence of data mining: finding what you were not looking for. It's the canonical example of association-rule value and defines descriptive mining in action.

ATTRIBUTESQ11. A hospital dataset has: Patient ID, Gender, Age, Blood Type, Satisfaction Rating (1โ€“5), Temperature, Diabetic (yes/no), Postcode, Number of Visits. Classify each attribute type and justify.
AttributeTypeSub-typeWhy
Patient IDNominalโ€”Category with no order (just a label); arithmetic on IDs is meaningless
GenderBinary (Nominal)โ€”Only 2 states (M/F); subset of nominal
AgeNumericContinuous (or Discrete in years)Measurable quantity; can be compared and averaged
Blood TypeNominalโ€”Categories (A, B, AB, O) with no meaningful order
Satisfaction Rating (1โ€“5)Ordinalโ€”Has meaningful order (5 is better than 1), but gaps between levels may not be equal
TemperatureNumericContinuousReal-valued, measurable on a continuous scale
Diabetic (yes/no)Binary (Nominal)โ€”Only 2 states
PostcodeNominalDiscrete (if numeric)LOOKS like a number, but order/arithmetic is meaningless โ€” it's a categorical label
Number of VisitsNumericDiscreteCountable (0, 1, 2, ...); cannot have "2.7 visits"

Key lesson: Just because something LOOKS like a number doesn't mean it's numeric. Postcodes, Zip codes, Student IDs are stored as numbers but are nominal โ€” no order, no arithmetic. Asking a mining algorithm to "average postcodes" is nonsense. Always ask: does order matter? does distance make sense? If not, it's nominal (or ordinal at best).

STATISTICSQ12. For the income dataset {20k, 22k, 24k, 25k, 26k, 28k, 30k, 1M}, compute mean, median, mode, and IQR. Which central-tendency measure best describes "typical income"? Why?

Sorted data: 20, 22, 24, 25, 26, 28, 30, 1000 (all in $1000s).

Mean: (20 + 22 + 24 + 25 + 26 + 28 + 30 + 1000) / 8 = 1175 / 8 โ‰ˆ $146.9k

Median: With 8 values, the median is the average of the 4th and 5th sorted values: (25 + 26) / 2 = $25.5k

Mode: No value repeats, so technically no mode (or every value is a mode, depending on convention).

Quartiles and IQR:

  • Q1 (25th percentile) = median of lower half {20, 22, 24, 25} = (22 + 24) / 2 = 23
  • Q3 (75th percentile) = median of upper half {26, 28, 30, 1000} = (28 + 30) / 2 = 29
  • IQR = Q3 โˆ’ Q1 = 29 โˆ’ 23 = 6 (i.e., $6k)

Best measure of "typical income" โ€” MEDIAN ($25.5k):

  • The mean ($146.9k) is badly distorted by the $1M outlier โ€” no one in this group actually earns anywhere near $146.9k.
  • The median is robust to outliers: removing or changing that $1M value doesn't move the median at all.
  • The IQR (6) also shows that the middle 50% of the data is tightly clustered between $23k and $29k โ€” the outlier is a single extreme value, not representative.

General rule: for skewed data (income, house prices, company revenues), always prefer the median over the mean. It's why governments report "median household income", not "mean".

SIMILARITYQ13. Two customers are described by (Age, Income, PurchaseFrequency). Customer A = (25, 40, 12). Customer B = (30, 55, 18). Compute Euclidean and Manhattan distance. Why might Cosine similarity be inappropriate here?

Differences per dimension:

  • |25 โˆ’ 30| = 5 (Age)
  • |40 โˆ’ 55| = 15 (Income)
  • |12 โˆ’ 18| = 6 (Purchase Frequency)

Manhattan distance (h = 1):
D = 5 + 15 + 6 = 26

Euclidean distance (h = 2):
D = โˆš(5ยฒ + 15ยฒ + 6ยฒ) = โˆš(25 + 225 + 36) = โˆš286 โ‰ˆ 16.91

Why Cosine similarity is inappropriate here:

  • Cosine measures the angle between vectors, ignoring magnitude. Two customers with identical purchasing patterns at different scales (e.g., one is 2ร— the other on every feature) would have cosine similarity 1 โ€” deemed "identical".
  • But in customer analytics, magnitude matters a lot: a 25-year-old earning $40k with 12 purchases is genuinely different from a 50-year-old earning $80k with 24 purchases, even though they have the same "shape".
  • Cosine is designed for sparse, high-dimensional data where direction matters more than magnitude โ€” most typically text documents, where doc length varies wildly but topic/direction is what counts.
  • For dense, low-dimensional numeric data like customer attributes, use Euclidean or Manhattan.

One more caveat โ€” scaling: Notice that Income (difference of 15) dominates the Euclidean distance because it has a larger numeric range than Age (difference of 5). In practice, before computing distance, normalize or standardize the variables so no single attribute dominates. This is part of KDD step 3 (Transformation).

7-PHASE PROCESSQ14. A retail chain wants to use data mining to reduce customer churn. Walk through Giudici's 7-phase data mining process for this specific scenario.
  1. A โ€” Define Objectives: Aim = reduce monthly churn by 15% in 6 months. Specifically, identify customers at high risk of churning 30โ€“60 days before they leave, so marketing can intervene. This is the MOST critical phase โ€” all downstream work depends on it.
  2. B โ€” Select & Organise Data: Pull from internal sources (cheaper, more reliable). Needed: transaction history, loyalty-card data, support call logs, demographics, marketing interactions. Build a Customer data mart from the warehouse. Data cleansing: handle missing emails, standardize phone formats, remove deceased customers.
  3. C โ€” Exploratory Analysis & Transformation: Use OLAP-style analysis: what % churn per store region? per age group? per tenure? Create derived features (recency of last visit, frequency per month, monetary value โ€” the RFM variables). Detect anomalies: customers with $0 spend but 20 visits (probably card error).
  4. D โ€” Specify Methods: This is a predictive task (target = "will churn in next 60 days?"). Candidate methods: decision tree (interpretable, easy to explain), logistic regression (classic baseline), random forest (higher accuracy), neural net (if lots of data). Decision tree might be chosen first for interpretability with marketing team.
  5. E โ€” Data Analysis: Train multiple models on historical data (e.g., 2022โ€“2023). Use cross-validation. Track precision, recall, AUC, and lift at different score thresholds.
  6. F โ€” Evaluate & Compare: Compare models on held-out data. Consider also training time, interpretability, ease of deployment, data quality sensitivity. If two models tie on accuracy, choose the more interpretable one for business adoption.
  7. G โ€” Interpretation & Implementation: Integrate the chosen model into the CRM: each night, score every customer's churn probability. High-scoring customers get a retention offer (discount, loyalty bonus, personal call). Track actual churn reduction versus baseline over the 6-month period. The virtuous circle of knowledge kicks in โ€” new churn data flows in, model is retrained.

Implementation phases (from Section 6.1): Strategic (where does mining add most benefit?) โ†’ Training (pilot on 3 stores) โ†’ Creation (scale to all stores) โ†’ Migration (train users, evaluate continuously).

VIZ TECHNIQUESQ15. Name and briefly describe the 4 main categories of data visualization. For each, give one scenario where it is the best choice.
  1. Pixel-Oriented Visualization
    Maps each data value to one colored pixel. Uses space-filling curves (e.g., Peano-Hilbert) to order pixels meaningfully. Maximum data density on a screen.
    Best scenario: visualizing 16,000+ daily financial observations across multiple markets โ€” exactly like the IBM/Dollar/Dow/Gold example.
  2. Geometric Projection
    Projects high-dimensional data into lower-dimensional space. Examples: scatter-plot matrix (pairwise scatterplots of all variables), parallel coordinates (each variable = vertical axis, each point = a line crossing them).
    Best scenario: exploring relationships among 6โ€“12 numeric variables in a customer dataset to find clusters or correlations.
  3. Icon-Based Visualization
    Encodes multiple attributes as features of small icons. Examples: Chernoff faces (face features encode variables), stick figures.
    Best scenario: showing multi-dimensional health patient profiles where humans instinctively read "facial" patterns โ€” a clinician can spot anomalies faster via Chernoff faces than a table of numbers.
  4. Hierarchical Visualization
    Partitions dimensions into subspaces and displays them in nested structures. Examples: tree-maps (nested rectangles sized by value), Worlds-within-Worlds (n-Vision).
    Best scenario: visualizing budget allocations across departments, sub-departments, and line items โ€” each level is a rectangle whose area = spend.

Choosing rule: pixel-oriented for massive time series, geometric projection for relationships among moderate-dimensional numeric data, icon-based for human-pattern-matching, hierarchical for nested/tree-structured data.

METADATAQ16. What is metadata? Why is it called "data about data"? Why does a warehouse without metadata fail, even if the warehouse itself is technically perfect?

Metadata = data ABOUT data. It describes the structure, meaning, history, and ownership of the data itself.

Typical metadata answers:

  • How was this variable calculated? (e.g., "monthly_revenue" = sum of completed transactions, excluding refunds, in local currency)
  • When did a field definition change? (e.g., "region" was redefined in 2021 when sales territories were restructured)
  • Which source system did each record come from?
  • Who last modified the record, when, and why?

Why a warehouse without metadata fails:

  • The Library analogy: A warehouse without metadata is like a library without a catalogue. The books exist; nobody can find what they need.
  • The Definition trap: Two teams compute "active customer" โ€” one uses "bought in last 30 days", the other uses "logged in last 30 days". Both put numbers in the warehouse. Without metadata, the CFO sees two conflicting counts with no way to know which is correct. Meetings descend into definition arguments.
  • Historical changes get lost: If the "region" field was redefined in 2021, any trend analysis crossing 2021 is meaningless unless the change is documented in metadata.
  • Compliance failure: GDPR and other regulations require organizations to know where personal data came from and where it's used. Without metadata, you can't answer "the right to be forgotten" requests.

Metadata increases the VALUE of the warehouse because it makes the data reliable, trustworthy, and traceable. Many organizations invest heavily in the warehouse infrastructure but skimp on metadata โ€” and their warehouse becomes a source of constant arguments rather than clear answers.

COMPARISONQ17. In one table, compare Data Warehouse, Data Mart, Data Lake, and Data Mesh across: data format, users, governance, main risk. In one sentence per architecture, say when you'd choose it.
DimensionData WarehouseData MartData LakeData Mesh
Data formatStructured onlyStructured onlyAny โ€” rawAny โ€” domain decides
Primary usersAnalysts, executivesDept. analystsData scientists, ML engineersDomain product teams
GovernanceCentralised, strictCentralisedOften weak (swamp risk)Federated (shared rules)
Main riskExpensive, slow ETLProliferation of inconsistent martsData swamp without governanceGovernance complexity across domains

When to choose each โ€” one-sentence rule:

  • Warehouse: when you need reliable, governed, consistent reporting for structured data โ€” typically finance, compliance, and executive dashboards.
  • Mart: when a specific team needs fast, focused access to a subset of warehouse data without navigating the full complexity.
  • Lake: when you have large unstructured or semi-structured data that data scientists need raw for ML training, or when you want to store first and decide use later.
  • Mesh: when your organization has many autonomous business units, a central data team has become a bottleneck, and domain teams need to move independently.

The Big Insight: These 4 architectures are not mutually exclusive. Every mature large organization (Netflix, Spotify, big banks) runs ALL 4 simultaneously, assigning each data type to the architecture that best fits its characteristics.

ALGORITHMSQ18. Choose an appropriate data-mining algorithm for each scenario and justify: (a) classify email as spam, (b) group supermarket customers into segments, (c) predict tomorrow's electricity demand, (d) find frequently co-purchased product pairs, (e) detect fraudulent credit-card transactions.

(a) Classify email as spam
Task: Predictive โ€” classification (known label: spam / not-spam).
Algorithm: Naive Bayes (classic baseline) or Logistic Regression. For higher accuracy: Random Forest or a Neural Network on text features.
Why: supervised classification with labeled training data; email text gives high-dimensional features; Naive Bayes is famously effective here.

(b) Group supermarket customers into segments
Task: Descriptive โ€” clustering (no predefined labels).
Algorithm: K-Means (most popular).
Why: unsupervised clustering of numeric customer features (RFM โ€” recency, frequency, monetary). Caveat: K-Means is sensitive to the choice of k (number of clusters); use the elbow method or silhouette score to choose.

(c) Predict tomorrow's electricity demand
Task: Predictive โ€” time series forecasting (continuous target, ordered in time).
Algorithm: classical ARIMA, Exponential Smoothing, or modern LSTM neural networks.
Why: time-ordered data with seasonality (daily, weekly, yearly cycles). Needs a time-series-aware method, not standard regression.

(d) Find frequently co-purchased product pairs
Task: Descriptive โ€” association rule mining (local method).
Algorithm: Apriori (classic) or FP-Growth (faster on big data).
Why: classic market basket analysis โ€” the beer-and-nappies problem. Apriori finds rules of the form {A, B} โ†’ {C} with support and confidence measures.

(e) Detect fraudulent credit-card transactions
Task: Can be framed two ways:

  • If you have labeled historical fraud examples โ†’ Predictive classification. Use Random Forest, Gradient Boosting, or Neural Networks. Imbalanced classes require resampling (SMOTE) or cost-sensitive learning.
  • If fraud patterns are novel and unlabeled โ†’ Anomaly / outlier detection (descriptive, local method). Use Isolation Forest or One-Class SVM.
Why: Real systems often combine both โ€” supervised models for known fraud patterns, anomaly detection for novel attacks.

Key takeaway โ€” map task type to algorithm family:

  • Predictive + known categories โ†’ Classification (trees, forests, neural nets, logistic regression)
  • Predictive + numeric target โ†’ Regression (linear, trees, neural nets)
  • Predictive + temporal โ†’ Time-series (ARIMA, LSTM)
  • Descriptive + grouping โ†’ Clustering (K-Means, hierarchical)
  • Descriptive + rules โ†’ Association (Apriori)
  • Descriptive + outliers โ†’ Anomaly detection (Isolation Forest)

โœ… Final 30-Minute Checklist LAST HOUR

With ~30 minutes before the exam, review these fast-recall items:

Definitions you should recite
  1. Data Mining = process of selection, exploration, modelling of large data to find unknown regularities, for business advantage.
  2. KDD = 5 steps: Selection โ†’ Preprocessing โ†’ Transformation โ†’ Data Mining โ†’ Interpretation.
  3. Data Warehouse = Subject-oriented, Integrated, Non-Volatile, Time-Variant (SIVT).
  4. Data Mart = thematic subset of a warehouse. 3 types: Dependent (best), Independent (risky), Hybrid.
  5. Data Lake = raw data, any format, schema-on-READ. Risk: data swamp.
  6. Data Mesh = 4 principles: Domain ownership ยท Data-as-product ยท Self-serve platform ยท Federated governance.
Hierarchies to remember
  • Intelligence hierarchy: Query & Reporting โ†’ Data Retrieval โ†’ OLAP โ†’ Data Mining.
  • Data pipeline: Warehouse โ†’ Webhouse โ†’ Mart โ†’ Matrix.
  • 7-Phase DM process: Objectives โ†’ Select & Organise โ†’ Exploratory โ†’ Specify Methods โ†’ Analysis โ†’ Evaluate โ†’ Implement.
  • 4 Implementation phases: Strategic โ†’ Training โ†’ Creation โ†’ Migration.
Lists of four
  • SIVT: Subject, Integrated, Non-Volatile, Time-Variant
  • Attribute types: Nominal, Binary, Ordinal, Numeric
  • Visualization categories: Pixel-oriented, Geometric projection, Icon-based, Hierarchical
  • Data types in a lake: Structured, Semi-structured, Unstructured, Streaming
  • Data Mesh principles: Domain ownership, Data-as-product, Self-serve, Federated governance
  • Finance panels: IBM (stock), Dollar (currency), Dow Jones (index), Gold (commodity)
๐ŸŽฏ The professor's specific question โ€” PRE-WRITE YOUR ANSWER

"Why use a picture with colors instead of a simpler one?"

Memorize: Because the dataset has 16,350 daily observations across 4 variables โ€” a line chart cannot show that density. Pixel-oriented viz + Peano-Hilbert curve keeps temporally close values spatially close, so stable periods become same-color blocks, shocks become sudden color shifts, and simultaneous color changes across all 4 panels reveal global events (crisis, policy, war) that a simple chart cannot show.

๐Ÿ€ Strategy for the 40 minutes of writing

  1. Minute 0โ€“3: Read ALL 5 questions. Note which ones you're most confident on.
  2. Minute 3โ€“35: Answer each question in roughly 6โ€“7 minutes. Write confident ones first.
  3. Minute 35โ€“40: Re-read, fix typos, add missing examples.
  4. For every case-study question: (1) DEFINE the key term, (2) APPLY to the case, (3) GIVE an example, (4) STATE risks/trade-offs.
  5. If stuck: bullet points still earn marks. Don't leave blanks.

๐ŸŽ“ You've got this. Good luck! ๐Ÿ€