๐ฏ Read This First (2 minutes)
5 classic questions (no multiple choice), 50 minutes total, papers taken at 40 min. Case-study style โ she gives you a situation, you explain what to do. The topics she said are important: KDD, data architecture (warehouse/mart/lake/mesh), algorithms, correlation & association, and the finance color-panels example.
๐ฅ If you only remember 6 things, remember these:
- KDD = 5 steps: Select โ Clean โ Transform โ Mine โ Interpret
- Data Warehouse = SIVT: Subject, Integrated, Non-Volatile, Time-variant
- 4 architectures: Warehouse (clean), Mart (small focused), Lake (raw everything), Mesh (every team owns their own)
- Data Mesh 4 principles: Domain ownership, Data-as-product, Self-serve, Federated governance
- Predictive (classify/regress โ known answer) vs Descriptive (cluster/associate โ find unknown)
- Finance colors (pixel panels) beat line charts because they show 16,000+ points at once and let you see patterns across 4 markets
โฐ 2-Hour Study Plan
| Time | What to do |
|---|---|
| 0:00โ0:20 | Read section "Like I'm 3" below โ get the big picture |
| 0:20โ0:40 | Memorize section "The Big 6" โ these are almost guaranteed exam topics |
| 0:40โ1:20 | Do ALL Flash Q&A โ cover the answer, guess, then check |
| 1:20โ1:50 | Read Mini Case Studies โ this is the EXACT exam format |
| 1:50โ2:00 | Read The Formula + Final Recap โ walk in confident |
๐ถ Every Topic Explained Like You're 3
- Selection โ Pick what ingredients you want (data you need)
- Preprocessing โ Wash the lettuce, throw away bad tomatoes (clean bad data)
- Transformation โ Cut everything into pieces that fit the bread (format data)
- Data Mining โ Put it all together! (run algorithms)
- Interpretation โ Taste it, tell your friends how it is (show results)
Descriptive = I DON'T know what I'm looking for. Just group/explore/find patterns. "Put similar customers together. Find things bought together."
Machine Learning = "Learn from examples so you can predict new stuff." (Algorithms)
Data Mining = "Look at all this data, find something useful for the business!" (Bottom-up, no starting guess โ and business value is the goal)
- Manhattan โ walk along streets, count blocks (sum of differences)
- Euclidean โ fly in a straight line (Pythagoras formula) โ MOST COMMON
- Cosine โ compare DIRECTION, not size (good for text documents)
Data Mining = Computer finds things I didn't ask. "Sales drop on rainy Mondays when there's football."
They are FRIENDS, not enemies. OLAP is used BEFORE data mining to explore.
๐ฏ The Big 6 โ Memorize These Exactly
These 6 are the MOST likely to appear in the exam. If you memorize them word-for-word, you'll have material for at least 3 of the 5 questions.
1๏ธโฃ Definition of Data Mining (Giudici, 2003)
"The process of selection, exploration, and modelling of large quantities of data to discover regularities or relations that are at first unknown, with the aim of obtaining clear and useful results for the owner of the database."
2๏ธโฃ The 5 KDD Steps
Selection โ Preprocessing โ Transformation โ Data Mining โ Interpretation
Memory trick: "Silly People Throw Dirty Ice" (bad but memorable!)
3๏ธโฃ Data Warehouse SIVT (Inmon, 1996)
- Subject-Oriented โ organized by Customer/Product, not by billing app
- Integrated โ same encoding (M/F everywhere, not 1/0 here and Male/Female there)
- V Non-Volatile โ never changed, just append new rows with timestamps
- Time-Variant โ 5โ10 years of history with time dimension on every record
4๏ธโฃ Data Mesh 4 Principles (Dehghani, 2019)
- Domain Ownership โ teams own their own data
- Data as a Product โ SLAs, docs, quality, versioning
- Self-Serve Infrastructure โ shared platform for all teams
- Federated Governance โ global rules, local implementation
5๏ธโฃ 4 Architectures โ One-Line Each
- Warehouse โ structured, clean, governed โ use for reports & finance
- Mart โ subset of warehouse โ use for specific team's fast access
- Lake โ raw everything, schema-on-read โ use for ML & data science
- Mesh โ decentralized, domain-owned โ use for huge agile organizations
6๏ธโฃ Why Colored Pixels Beat Line Charts (Finance Example)
- 16,350 data points โ line chart can only show ~1,000
- Peano-Hilbert curve keeps time-close points space-close โ stable periods become visible blocks
- Green = high, Purple = low โ eye reads patterns instantly
- 4 panels side-by-side (IBM, Dollar, Dow, Gold) reveal GLOBAL events when all 4 change color at the same region
- A line chart cannot compare 4 markets visually at once
โก Flash Q&A โ 30 Rapid Questions
Cover each answer with your hand. Try to answer aloud. Then check. If you get it wrong โ read once and move on. Come back to it later.
Show answer
Show answer
Show answer
Show answer
Show answer
Show answer
Show answer
Show answer
Show answer
Show answer
Show answer
Show answer
Show answer
Show answer
Show answer
Show answer
Show answer
Show answer
Show answer
Show answer
Show answer
Show answer
Show answer
Show answer
Show answer
Show answer
Show answer
Show answer
Show answer
Show answer
๐ Mini Case Studies โ The Exam Format
These are the EXACT format your exam will use. She gives you a situation โ you apply concepts โ you recommend/explain. Practice these!
This is a Preprocessing problem โ step 2 of KDD. Steps to take:
- Data cleansing โ standardize names using fuzzy matching (identify "John Smith" = "J. Smith")
- Format normalization โ convert all dates to one format (ISO: YYYY-MM-DD)
- Handle missing values โ either fill with defaults, or drop incomplete records
- De-duplicate โ remove or merge duplicate records
- Apply integration rules โ like a data warehouse would (subject-oriented view of "Patient")
Use ALL the architectures together:
- Data Warehouse for clean sales transactions. Structured, needed for financial reporting, governance-critical. Apply SIVT characteristics.
- Data Lake for the 50 million unstructured product reviews (text data) AND the real-time clickstream. Both are unstructured/semi-structured, need flexibility, for data scientists to mine (sentiment analysis, recommendation).
- Data Marts built from the warehouse for specific teams: Marketing mart, Finance mart, Inventory mart. Dependent marts ensure consistency.
- Consider Data Mesh principles if the company has many autonomous product teams that need to publish their own data products โ but apply federated governance.
The problem isn't the warehouse โ it's that 4 teams are all fighting over the same complex resource. Recommend creating 4 dependent data marts (one per team):
- Compliance Mart โ regulatory data, audit trails
- Fraud Mart โ transaction patterns, risk scores
- Marketing Mart โ customer demographics, campaign data
- Risk Mart โ exposure, credit scores
Each team gets FAST, FOCUSED access to only the data they need. Because marts are dependent (derived from the central warehouse), all teams get consistent numbers. If the warehouse says "revenue = X", all marts inherit this โ no "multiple versions of truth" problem.
- Data density: 16,350 daily values โ a line chart can only handle ~1,000 points before becoming unreadable. The pixel viz shows all 16,350 at once.
- Temporal locality preserved: The Peano-Hilbert curve puts time-close points spatially close. Stable periods appear as colored blocks the eye instantly sees.
- Pattern visibility: Green clusters = bull markets (high prices). Purple clusters = bear markets (low prices). Hard to see on a crowded line chart.
- Cross-variable visibility: 4 panels side-by-side. A line chart with 4 overlapping lines becomes noise. With panels, you compare at a glance.
- Global event detection: If all 4 markets (stock, currency, index, commodity) change color in the same region โ SYSTEMIC SHOCK (e.g. 1987 Black Monday). A line chart cannot reveal this.
This is Descriptive mining, specifically Association Rule mining (a local method in Giudici's classification). Algorithm: Apriori (classic) or FP-Growth (faster for big data).
It finds rules of the form {A, B} โ {C} with support and confidence measures. "Support" = how often the items appear together. "Confidence" = if A is bought, how likely is C also bought?
Famous example: Walmart discovered {diapers} โ {beer} โ on Friday evenings, new fathers bought both. Walmart moved beer displays next to diapers; sales of both rose. This is the canonical example of value from association rule mining.
- Selection: Extract relevant data โ login frequency, last activity date, features used, support tickets, in-app purchases, demographics.
- Preprocessing: Clean missing values, standardize formats, remove test accounts.
- Transformation: Engineer features like "days since last login", "activity trend last 30 days", normalize numeric features.
- Data Mining: This is Predictive Classification (binary: will churn / won't churn). Try Decision Tree (interpretable), Random Forest (accurate), Logistic Regression (baseline).
- Interpretation: Compare models on held-out data. Deploy best model to CRM. High-risk customers get a retention email.
Monitor after deployment โ the virtuous circle of knowledge kicks in as new churn outcomes feed back to retrain the model.
๐งช The Universal Answer Formula
For ANY case-study question on the exam, use this 4-step formula. It works every time.
- Use bullet points when listing things โ easier to mark, clearer to read.
- Name-drop the authors: Inmon (warehouse), Dehghani (mesh), Berry & Linoff (virtuous circle), Giudici (definition) โ shows you read the material.
- Use the exam vocabulary: SIVT, KDD, OLAP, schema-on-read, federated governance, etc. โ even if a simple word would work.
- If stuck, list pros/cons: You'll get partial marks for showing you understand the trade-offs.
- Don't leave blanks: A partial answer gets partial marks. An empty answer gets zero.
๐ Sample: Applying the Formula to a Real Question
Step 1 โ DEFINE: Data marts are thematic, subject-specific subsets of data used by specific business teams. There are 3 types: dependent (from a warehouse), independent (from source systems), and hybrid. This company is using independent marts.
Step 2 โ APPLY: Each of the 4 independent marts has its own ETL pipeline and its own interpretation of business rules. So when the board asks "what were our claims last quarter?", each team gives a different answer based on their mart's definitions. This creates the classic "multiple versions of truth" problem.
Step 3 โ EXAMPLE: Imagine one team calculates "revenue" as gross sales, another uses net of refunds, a third uses post-commission. All call their number "revenue". Like building 4 different rulers and arguing about who's taller.
Step 4 โ FIX + TRADE-OFFS: Convert to dependent marts derived from a central data warehouse. Define "revenue" once in the warehouse; all marts inherit. Trade-off: slower to build initially, requires warehouse investment. But long-term: single source of truth, consistent board numbers, regulatory compliance. This is the Inmon top-down approach.
๐ฅ Final 15-Minute Recap
Read this as the LAST thing before walking into the exam.
๐ Top-of-mind words to drop in any answer
KDD ยท SIVT ยท Subject-oriented ยท Integrated ยท Non-Volatile ยท Time-Variant ยท Schema-on-read ยท Schema-on-write ยท Federated governance ยท Domain ownership ยท Data-as-a-product ยท Dependent mart ยท Data swamp ยท OLAP ยท Peano-Hilbert curve ยท Virtuous circle of knowledge ยท Bottom-up explorative ยท Top-down confirmative ยท Association rule ยท Supervised ยท Unsupervised ยท Inmon ยท Dehghani ยท Giudici ยท Berry & Linoff
- Explain KDD / the 5 steps of knowledge discovery
- Compare/recommend among Warehouse, Mart, Lake, Mesh (case study)
- Explain SIVT characteristics of a warehouse
- Why colored pixel panels beat a simple line chart (finance example)
- Predictive vs Descriptive mining + give examples (beer-nappies)
- Covariate = feature = variable = attribute โ all the same thing (a column)
- Supervised = predictive = asymmetrical = direct
- Unsupervised = descriptive = symmetrical = indirect
- Non-volatile = doesn't change. Volatile = changes.
- Exogenous = from outside. Data retrieval uses exogenous criteria (user decides).
- Explorative = bottom-up (data mining). Confirmative = top-down (statistics).
- Minute 0โ3: Read ALL 5 questions. Start with the easiest to build confidence.
- Minute 3โ35: ~6โ7 minutes per question. Use the 4-step formula.
- Minute 35โ40: Review. Fix typos. Add one more example where you have space.
- If you blank: start defining the key term in the question. Writing unlocks memory.
- If running out of time: switch to bullet points. Partial credit > nothing.
โก Quick Exam Cheat Sheet READ FIRST
- 5 classic questions (not MCQ) ยท 50 minutes ยท papers collected at 40 min
- Case-study format โ she'll give a situation, you explain / decide / apply
- Focus topics she named:
- Database Management System of Data Mining (Warehouse / Mart / Lake / Mesh)
- Algorithms (Decision Tree, Neural Net, K-Means, Apriori, etc.)
- KDD โ "is important"
- Correlation and Association
- Finance example โ "why would I want to use a picture with colors instead of a simpler one?"
- Slides 9+ of the Week 4/5 deck โ she said she won't print the colorful panels (no color printer), but she may still ask about them
- So: understand the colored finance visualization even though you won't see it on paper
๐ The 10 things you must know cold
- KDD = Knowledge Discovery in Databases. 5 steps: Selection โ Preprocessing โ Transformation โ Data Mining โ Interpretation.
- Data Mining definition: process of selection, exploration, and modelling of large data to discover unknown regularities, with the aim of producing results useful for the database owner (business advantage).
- Statistics vs Data Mining: Statistics is top-down (confirm hypothesis). Data Mining is bottom-up (discover unknown patterns).
- Data Warehouse SIVT: Subject-oriented, Integrated, Non-Volatile, Time-Variant (Inmon, 1996).
- Data Mart types: Dependent (from warehouse โ best), Independent (from sources โ risky silos), Hybrid (both).
- Data Lake: raw data, any format, schema-on-READ. Risk: becomes a "data swamp" without governance.
- Data Mesh 4 principles: Domain ownership, Data-as-a-product, Self-serve platform, Federated governance.
- Intelligence hierarchy: Query & Reporting โ Data Retrieval โ OLAP โ Data Mining (capacity grows, difficulty grows).
- Core mining tasks: Predictive (classification, regression, time-series) and Descriptive (clustering, association rules, anomaly detection).
- Finance pixel viz: 16,350 data points shown via Peano-Hilbert curve. Green = high, purple = low. Cross-panel color sync = global market event.
Spend the first 3 minutes reading ALL 5 questions. Then give each question ~7 minutes. Write in bullet form if time is short โ examiners still give marks. Always DEFINE the key term first, then APPLY it to the case.
๐ Week 1 โ Introduction to Data Mining W1
1.1 What is Data Mining?
Data Mining is the computational process of discovering patterns, correlations, anomalies, and insights in large datasets using statistical, mathematical, and machine learning techniques. It is also called Knowledge Discovery in Databases (KDD).
1.2 The KDD Process (5 steps)
Each step feeds into the next. If your data quality is bad at step 2, your final insights at step 5 will also be bad. ("Garbage in, garbage out.")
1.3 Core Tasks in Data Mining
๐ฎ Predictive Mining
Predicts a known target variable. Also called supervised, asymmetrical, direct.
- Classification โ assign to categories (spam vs not-spam)
- Regression โ predict numeric value (house price)
- Time Series โ forecast future from past (stock price)
๐ Descriptive Mining
Finds unknown structure in data. Also called unsupervised, symmetrical, indirect.
- Clustering โ group similar items (no labels)
- Association Rules โ co-occurrence (beer + nappies)
- Anomaly Detection โ spot outliers (fraud)
Correlation = statistical relationship between two variables (e.g. temperature โ โ ice cream sales โ).
Association rule = a pattern "if A, then B often happens too" found in baskets/transactions. Famous example: {diapers} โ {beer} (Walmart discovered young fathers bought both on Friday evenings).
1.4 Key Algorithms โ when to use each
| Algorithm | Type | When to use |
|---|---|---|
| Decision Trees | Predictive (classif./regr.) | Tabular data, need interpretable rules, easy to explain to managers |
| Neural Networks | Predictive | Complex patterns (images, text), when you have lots of data |
| SVM (Support Vector Machine) | Predictive (classification) | Small/medium datasets, high-dimensional features |
| Random Forest | Predictive (ensemble) | Robust, high accuracy, general-purpose workhorse |
| K-Means | Descriptive (clustering) | Group customers/products, need fast unsupervised grouping (sensitive to choice of k) |
| Apriori | Descriptive (association) | Market basket analysis, find "A โ B" rules |
1.5 Real-World Applications
1.6 Challenges & Ethics
Technical challenges
- Curse of dimensionality โ too many variables make analysis hard
- Missing, noisy, inconsistent data
- Scalability โ big data needs distributed systems
- Interpretability vs accuracy trade-off โ neural nets are accurate but hard to explain
- Concept drift โ patterns change over time (e.g. shopping habits during COVID)
Ethical considerations
- Privacy โ using personal data responsibly
- Bias & fairness โ don't amplify societal bias (e.g. hiring AI discriminating)
- Transparency โ right to explanation for automated decisions
- Data ownership โ who owns patterns found in your data?
- Security โ protect sensitive mined knowledge
๐ Week 2 โ Foundations & Process W2
2.1 Formal Definition (Giudici, 2003)
"Data mining is the process of selection, exploration, and modelling of large quantities of data to discover regularities or relations that are at first unknown, with the aim of obtaining clear and useful results for the owner of the database."
Three key ideas:
- Business Intelligence Process โ not just an algorithm. It supports company decisions.
- Bottom-up exploration โ unlike statistics, data mining searches for unknown patterns. Goals are not predetermined.
- Knowledge extraction โ the ultimate aim is business advantage for the database owner.
The strategic decision from data mining creates new measurement needs โ new business needs โ new data โ new analysis. It's a self-reinforcing feedback loop.
2.2 Statistics vs Machine Learning vs Data Mining
| Dimension | Statistics | Machine Learning | Data Mining |
|---|---|---|---|
| Primary Goal | Test hypotheses | Reproduce data-generating process | Extract business value |
| Data Type | Primary, experimental | Any, often large | Secondary, from warehouses |
| Approach | Top-down (confirmative) | Generalization from examples | Bottom-up (explorative) |
| Models | Single reference model | Multiple competing models | Multiple models, chosen by data fit |
| Role | Academic, testing | Prediction & classification | Decision support & strategy |
2.3 Knowledge Discovery in Databases (KDD) โ History
2.4 Data Mining & Computing โ The Intelligence Hierarchy
There is a trade-off: the more information a tool gives you, the harder it is to implement.
OLAP tests user hypotheses (user says "let me check if sales drop on Mondays"). Data Mining uncovers unknown relations ("the algorithm found that sales drop on rainy Mondays with football matches"). They are complementary, not competitors โ OLAP is often used in the preprocessing stages of data mining.
2.5 Data Mining vs Statistics โ Key Differences
Berry and Linoff (1997) distinguish two approaches:
- Top-down (Confirmative) โ Statistics. Confirms or rejects a hypothesis. Uses traditional statistical methods.
- Bottom-up (Explorative) โ Data Mining. User looks for useful information previously unnoticed, searching data to create hypotheses.
In practice the two are complementary. Confirmative tools can check the discoveries made by exploratory mining.
Statistical criticisms of data mining (historical)
Statisticians once called it "data fishing", "data dredging", "data snooping":
- No single theoretical reference model โ many models compete. Criticism: always possible to find some model that fits.
- The great amount of data may lead to non-existent relations being found.
Modern data mining addresses this by focusing on generalization: when choosing a model, predictive performance is prioritized and complex models are penalized.
2.6 The 7-Phase Data Mining Process
Four Implementation Phases (for integrating mining into the company)
- Strategic โ study business procedure, identify where mining adds value.
- Training โ run a pilot project, assess results.
- Creation โ if pilot works, plan full reorganization.
- Migration โ train users, integrate, evaluate continuously.
2.7 Three Classes of Data Mining Methods
| Class | Also called | Goal | Examples |
|---|---|---|---|
| Descriptive | Symmetrical / Unsupervised / Indirect | Describe groups of data briefly. Classify into groups not known beforehand. | Association methods, log-linear, graphical models, clustering |
| Predictive | Asymmetrical / Supervised / Direct | Describe one variable in relation to others. Classification or prediction rules. | Neural networks, decision trees, linear regression, logistic regression |
| Local | โ | Identify characteristics in subsets of the database. | Association rules, anomaly/outlier detection |
2.8 Organisation of the Data โ Pipeline
Why it matters: the way you analyze data depends on how the data is organised. Efficient analysis requires valid organisation. The analyst must be involved when setting up the database.
Data Warehouse โ Data Webhouse (web-adapted) โ Data Mart (thematic) โ Data Matrix (ready for analysis).
๐ Week 3 โ Data Architecture Deep Dive W3
3.1 Why Data Organisation Matters
Before any mining algorithm runs, data must be accessible, consistent, and complete. Poor organisation creates three failures:
- Missing Information โ Operational databases (billing, orders) are built for transactions, not analysis. Fields analysts need may simply not exist.
- Noise & Inconsistency โ Duplicates, spelling variations ("Ltd" vs "Limited"), inconsistent date formats (DD/MM/YYYY vs MM/DD/YYYY). Algorithms treat noise as signal and produce confidently wrong answers.
- Inaccessible Silos โ Data scattered across CRM, ERP, marketing platform, ticketing system. Analysts spend time finding data, not mining it.
3.2 Deep Dive โ The Data Warehouse
"A data warehouse is an integrated collection of data about a collection of subjects (units), which is not volatile in time and can support decisions taken by management." โ Inmon (1996)
The 4 SIVT Characteristics โ MEMORIZE
Data Warehouse Architecture (layers)
Two philosophies for building a warehouse
โฌ๏ธ Top-Down (Inmon)
- Build enterprise warehouse FIRST
- Derive marts from warehouse after
- Quality enforced centrally
- Slow โ 1โ3 years before first mart
- Common in regulated industries
- Risk: high upfront cost
โฌ๏ธ Bottom-Up (Kimball)
- Build individual data marts FIRST
- Join marts into warehouse later
- Quality enforced per mart
- Fast โ first mart in weeks
- More popular overall
- Risk: inconsistency across marts
Walmart's warehouse stored 2.5 petabytes, processed 1M+ transactions/hour, connected 100,000+ suppliers. Analysts mined the warehouse and found: beer and nappies (diapers) are frequently bought together on Friday evenings โ new fathers on emergency supply runs. Walmart moved beer displays next to nappies, and sales of both products rose.
Lesson: No analyst would have thought to test this hypothesis. The warehouse made it possible to discover patterns from billions of transactions. This is the essence of data mining: finding what you were not looking for.
3.3 Deep Dive โ Data Marts
A data mart is a thematic, subject-specific subset of a data warehouse โ smaller, faster, built for a specific team.
Metaphor: If a warehouse is a large central library, a mart is a specialist reading room stocked with only books relevant to one department.
Three Types of Data Mart
| Type | Source | Pros | Cons |
|---|---|---|---|
| Dependent (gold standard) | From a central warehouse | Consistent, governed. Changes propagate everywhere. | Requires warehouse to exist first. |
| Independent | Directly from source systems | Fast to set up. | Creates silos. 20 marts = 20 different definitions of "revenue" โ boardroom chaos. |
| Hybrid | Both warehouse AND direct sources | Flexible (e.g. clean history + real-time sensors). | More complex, potential inconsistency. |
From one warehouse, 4 dependent marts:
- Customer Mart โ demographics, policy history โ CRM team
- Claims Mart โ claims events, fraud scores โ actuaries
- Risk Mart โ risk scores, exposure โ underwriters
- Finance Mart โ P&L by product line โ CFO's team
All 4 draw from ONE warehouse. Actuaries' claims numbers and Finance's claims costs must match โ because they come from the same source.
3.4 Deep Dive โ Data Lake
Imagine a vast lake. Rivers flow in from everywhere โ clean mountain streams (structured data), murky runoff (web logs), sediment-laden rivers (unstructured text). Everything coexists. You decide what to fish out, when you need it, using the right equipment.
A data lake stores all data โ structured, semi-structured, unstructured, streaming โ in its raw, unprocessed form.
What flows into a Data Lake (4 types)
- Structured โ CSV, SQL tables, ERP exports (clear rows and columns)
- Semi-structured โ JSON, XML, log files, sensor readings (some structure, not rigidly tabular)
- Unstructured โ emails, PDFs, social media, images, audio, video (no predefined structure)
- Streaming โ real-time clickstreams, IoT telemetry, market feeds (never stops flowing)
Schema-on-WRITE vs Schema-on-READ
| Concept | Warehouse (Schema-on-Write) | Lake (Schema-on-Read) |
|---|---|---|
| When schema defined | Before data is loaded | At query time, by user |
| Transformation | During ETL, before loading | On demand, when reading |
| Flexibility | Low โ fixed at design | High โ same data, many uses |
| Data quality | High (enforced at write) | Variable โ raw may be messy |
| Speed to set up | Slow to set up, fast queries | Fast to set up, slow queries |
| Who defines meaning | Data engineers, upfront | Data scientists, at analysis time |
Many organizations built lakes enthusiastically, then abandoned them within two years. The failure mode is the data swamp: massive data with no metadata, no cataloguing, no access controls, no governance. Analysts can't find what they need. Sensitive data sits unprotected. The lake becomes an expensive, unusable swamp.
Lesson: a lake needs as much governance as a warehouse โ just applied at read time instead of write time.
Spotify: 600M+ users, 100B+ streams/day, 4PB+ new data/day. Stores every play, skip, pause, search, playlist event, plus raw audio waveforms, blog descriptions, acoustic features.
Mining this lake โ Spotify Wrapped: every December, each user gets a personalized year summary. In 2023, Wrapped generated more social media engagement than any paid ad campaign Spotify ran. The data lake became marketing gold.
3.5 Deep Dive โ Data Mesh
The problem Data Mesh solves
Warehouses, marts, and lakes all assume data is collected centrally by one data team. For large organizations with dozens of business units, this central team becomes a bottleneck: teams queue for weeks, by the time data is ready the question has changed.
The 4 Principles of Data Mesh (Zhamak Dehghani, 2019) โ MEMORIZE
3.6 Case Study โ Netflix (uses ALL 4 simultaneously)
Netflix is perfect teaching example because NO single architecture serves all its needs:
- Warehouse (Hive + Presto) โ financial reporting, subscriber metrics, quarterly reports. "How many subscribers in Q3?" Clean, governed data required by law.
- Mart (Personalisation) โ recommendation engine. Needs millisecond queries, isolated from finance warehouse.
- Lake (S3 "Data Platform") โ raw event streams, A/B test logs, microservice telemetry. Petabytes of messy data for data scientists.
- Mesh (distributed teams) โ payments, content, streaming quality each own their data products, published to a shared catalogue.
Different tasks have different requirements for freshness, speed, quality, governance, flexibility. No single architecture satisfies all requirements at once. Regulatory financial reporting can't run from a raw lake; millisecond personalization can't query a slow warehouse.
3.7 Architecture Comparison Table
| Dimension | Warehouse | Mart | Lake | Mesh |
|---|---|---|---|---|
| Data format | Structured only | Structured only | Any โ raw | Any โ domain decides |
| Schema | Write-time (strict) | Write-time | Read-time (flexible) | Varies by domain |
| Primary users | Analysts, execs | Dept. analysts | Data scientists | Domain teams |
| Update freq. | Batch (hours/days) | Batch | Near real-time | Domain-managed |
| Scale | TBโPB | GBโTB | PB+ | Unlimited, distributed |
| Governance | Centralised, strict | Centralised | Often weak | Federated |
| Best for | BI & reporting | Dept. analytics | ML, exploration | Large agile orgs |
| Main risk | Expensive, slow | Proliferation | Data swamp | Governance complexity |
3.8 Decision Guide โ Which Architecture?
- Warehouse when: mostly structured data, need governed/consistent reporting, serve finance/compliance/executives.
- Mart when: a specific team needs fast focused access; the warehouse is too complex for daily use.
- Lake when: large unstructured/semi-structured data; data scientists need raw data; need to store now, decide use later.
- Mesh when: many autonomous business units; central team is a bottleneck; domains need to move fast.
These 4 architectures are not mutually exclusive. Every large, mature organization uses ALL 4 simultaneously โ assigning different data to the right architecture for its use case. The question is never "which single one?" but "which fits this specific data and need?"
๐ Week 4โ5 โ Data Characterization & Visualization W4-5
4.1 Data Objects and Attribute Types
A data object (also called an observation) represents an entity (e.g. a customer, a product). Each object is described by attributes (also called covariates, features, or variables).
| Type | Description | Examples |
|---|---|---|
| Nominal | Categorical, NO order | Hair color, Occupation, Country |
| Binary | Nominal with only 2 states | Yes/No, 0/1, Male/Female |
| Ordinal | Categorical WITH meaningful order | Education level, Size (S/M/L), Ratings (โ โ โ ) |
| Numeric | Measurable quantity | Temperature, Income, Age |
Numeric sub-types
- Discrete โ finite or countably infinite values (zip codes, number of children). You can COUNT them.
- Continuous โ real numbers / floating-point (height, weight). You MEASURE them.
4.2 Basic Statistical Descriptions
Central Tendency โ "What is the typical value?"
Sum of all values รท N
Sensitive to outliers โ one billionaire in the room pulls the average up.
The middle value when data is sorted.
Robust to outliers โ preferred for skewed data (e.g. income, house prices).
The value that appears most often.
Can be multimodal (more than one mode).
Average of max + min. ((max + min) / 2)
Rarely used alone โ very sensitive to extremes.
Dispersion โ "How spread out is the data?"
- Quartiles โ Q1 = 25th percentile, Q3 = 75th percentile
- IQR (Interquartile Range) = Q3 โ Q1. Measures spread of the middle 50%.
- Variance โ average squared distance from the mean.
- Standard Deviation โ square root of variance. Same unit as data.
Five-Number Summary & Boxplot
The five-number summary is: Min, Q1, Median, Q3, Max. It's visualized using a boxplot:
- The box = IQR (Q1 to Q3)
- Line inside box = Median
- Whiskers extend to min/max (or to ยฑ1.5 ร IQR)
- Points beyond whiskers = outliers
4.3 Data Visualization โ 4 Main Categories
| Technique | Description | Examples |
|---|---|---|
| Pixel-Oriented | Maps attribute values to colored pixels. Maximizes data on screen. | Space-filling curves, Circle segments |
| Geometric Projection | Projects high-dimensional data into lower-dimensional views. | Scatter-plot matrix, Parallel coordinates |
| Icon-Based | Encodes attributes as features of small icons. | Chernoff faces, Stick figures |
| Hierarchical | Partitions dimensions into subspaces, displays nested. | Tree-maps, Worlds-within-Worlds (n-Vision) |
4.4 Pixel-Oriented Visualization โ IN DEPTH
Core concept
A normal line chart can show maybe 100โ1,000 data points. A pixel-oriented visualization does something clever: every single pixel on the screen = one data value. So instead of 1,000 points, you can show tens of thousands at once.
Think: each tiny square = one observation.
The ordering problem โ Space-Filling Curves
If we place pixels randomly, the picture becomes meaningless. We need a smart ordering that keeps similar values spatially close. The trick: use a space-filling curve โ a continuous path that visits every point in a square exactly once, without jumping.
The most famous is the Peano-Hilbert curve. It looks like a twisty, recursive snake that fills a square completely. Its key property: nearby positions on the curve = nearby positions in time/data order. This preserves local structure.
Practical analogy: Take a long time-series list and instead of drawing it as a flat line, "fold" it into a square by bending it along the Hilbert curve. The data stays ordered, but is now packed into a compact square image.
Color encoding
- Bright / Green โ High values
- Dark / Purple โ Low values
(Exact colors depend on the scale used for each variable.)
๐ฐ Finance Example โ IBM / Dollar / Dow Jones / Gold HIGHLY EXPECTED
5.1 The Dataset
A stock exchange database with 16,350 data items, covering 4 financial variables recorded daily from January 1987 to March 1993:
Approximate reconstruction โ 4 panels, each using the Peano-Hilbert curve to show one variable over 6 years of daily data.
- IBM (top-left) โ stock price
- Dollar (top-right) โ US$ exchange rate
- Dow Jones (bottom-left) โ stock market index
- Gold (US$) (bottom-right) โ gold price
For each trading day, 9 separate values are recorded. Each panel is a separate Peano-Hilbert visualization of one variable.
5.2 How to interpret the graphic
Why does it look like blocks?
- Time is preserved locally โ nearby pixels = nearby time
- Globally, the curve bends โ this creates patch-like colored regions
These blocks (patches) correspond to periods of similar market behavior โ stable conditions or prolonged trends.
Reading the patterns โ the 3 KEY interpretation rules
Why? Combines two properties:
- Color: purple = low values
- Hilbert property: nearby pixels = nearby time
So a large connected purple region means MANY consecutive time points ALL with low values โ prolonged low-price period.
Same logic reversed:
- Green = high values
- A cluster of green pixels = many high values close in time
Implies: a period where the index was consistently high โ strong market.
Why? The 4 panels represent 4 different market categories:
- Stock (IBM)
- Index (Dow Jones)
- Currency (Dollar)
- Commodity (Gold)
If they ALL shift at the same time, different markets moved together โ this is consistent with a common external shock: financial crisis, policy change, oil shock, war, geopolitical event.
5.3 WHY use colored panels instead of a simpler chart?
Five key reasons:
- Data density โ A line chart can show ~1,000 points before it becomes an unreadable mess. The pixel-oriented view shows 16,350 values clearly on one screen, one per pixel.
- Temporal locality is preserved โ The Peano-Hilbert curve keeps points close in time also close in space. So we can SEE stable periods as colored blocks.
- Stable periods become visible โ Large same-color areas = prolonged stable market conditions. Hard to see on a crowded line chart.
- Shocks and volatility jump out โ Sudden color transitions are obvious to the eye.
- Cross-variable patterns are visible โ With 4 panels side-by-side, simultaneous color shifts across panels reveal global events affecting multiple markets at once โ something line charts can't easily show with 4 overlapping lines.
One-sentence summary: Colored pixel panels compress massive time-series into visible structure, revealing stable periods, volatility, and cross-market co-movements that a simple line chart cannot show because of limited screen space and no way to compare 4 variables simultaneously.
5.4 What to look for
- Same color regions โ stable periods
- Sudden color changes โ shocks or volatility
- Patterns across panels โ relationships between variables
- Synchronized shifts across all 4 panels โ possible global event (e.g. 1987 Black Monday, early-90s recession)
๐ Measuring Similarity & Dissimilarity W5
6.1 Why this matters
Many mining algorithms (clustering, nearest-neighbor, recommendation) need to answer: how similar is object A to object B? For that we need a distance measure.
6.2 Two fundamental data structures
- Data Matrix โ n ร p (n objects as rows, p attributes as columns). Object-by-Attribute format.
- Dissimilarity Matrix โ n ร n (each entry = distance between two objects). Object-by-Object format.
6.3 Minkowski Distance โ the general formula
Distance between two objects x and y in p dimensions:
Special cases of Minkowski
| h value | Name | Idea | When to use |
|---|---|---|---|
| h = 1 | Manhattan / City Block | Sum of absolute differences along each axis | Grid-like movement, e.g. streets in Manhattan |
| h = 2 | Euclidean | Straight-line distance (Pythagoras!) | Default for continuous numeric data |
| h โ โ | Supremum / Chebyshev | Maximum absolute difference across dimensions | When worst-case deviation matters |
Two points (1,1) and (4,5):
- Manhattan: |4โ1| + |5โ1| = 3 + 4 = 7 (walk along streets)
- Euclidean: โ((4โ1)ยฒ + (5โ1)ยฒ) = โ(9+16) = โ25 = 5 (fly in straight line)
- Supremum: max(|4โ1|, |5โ1|) = max(3, 4) = 4 (biggest single gap)
6.4 Cosine Similarity โ for sparse/text data
Cosine similarity measures the angle between two vectors, not their magnitude. Ideal for high-dimensional sparse data like text documents (where most values are 0).
- sim = 1 โ vectors point in the same direction (identical orientation)
- sim = 0 โ orthogonal (completely dissimilar)
- sim = โ1 โ opposite direction
Direction-sensitive, not magnitude-sensitive. A 10-word doc and a 1000-word doc about the same topic have high cosine similarity even though their lengths differ hugely.
6.5 Quick-reference summary
| Measure | Best For | Notes |
|---|---|---|
| Euclidean | Continuous numeric data | Default choice, uses magnitude |
| Manhattan | Grid-like or when axes differ | Less sensitive to outliers than Euclidean |
| Cosine | Text, sparse high-dim data | Ignores magnitude, uses angle |
| Supremum | Worst-case scenarios | Only largest difference matters |
๐ English Vocabulary Glossary TERMS
Key English words you might encounter in the exam questions. Memorize these meanings.
โ๏ธ Practice Exam โ Case-Study Questions PRACTICE
Below are 18 practice questions in the format your professor uses โ classic (not MCQ), case-study style. Try to answer each one in your head (or on paper) BEFORE opening the answer. Tap each question to reveal the model answer. This is the most important part of your preparation.
HIGH PRIORITYKDDQ1. Define Data Mining and explain the 5 steps of the KDD process. For each step give one real example of what could go wrong.
Definition: Data Mining is the process of selection, exploration, and modelling of large quantities of data to discover regularities or relations that are at first unknown, with the aim of obtaining clear and useful results for the owner of the database. It is also known as Knowledge Discovery in Databases (KDD).
The 5 KDD steps:
- Data Selection โ identify relevant data sources and extract the target dataset.
What can go wrong: picking the wrong date range, e.g. using only 2020 COVID-year data to predict "normal" customer behavior. - Preprocessing โ handle missing values, noise, inconsistencies.
What can go wrong: failing to standardize "M"/"Male"/"male" โ the algorithm treats them as 3 different categories. - Transformation โ feature engineering, normalization, dimensionality reduction.
What can go wrong: not scaling variables so one feature (e.g. income in thousands) dominates another (age 0โ100). - Data Mining โ apply algorithms to discover patterns.
What can go wrong: choosing a neural network for a small dataset when a decision tree would give an interpretable answer. - Interpretation / Evaluation โ evaluate, visualize, communicate findings.
What can go wrong: presenting a correlation as causation, or reporting an accuracy that hides poor performance on rare classes.
Key point: each stage feeds the next โ data quality at early stages directly affects the value of the final knowledge. (Garbage in, garbage out.)
HIGH PRIORITYARCHITECTUREQ2. CASE STUDY: A mid-sized retail bank has: a 10-year Oracle data warehouse with clean account data, 8 million unstructured customer support email threads stored as text files, real-time ATM sensor logs, and 4 teams all queuing for one 6-person data engineering team. Recommend an architecture strategy.
Step 1 โ Assign each data type to the right architecture:
- Account/transaction data โ Keep the Data Warehouse. Structured, governance-critical for regulatory reporting (Basel III, GDPR). Don't throw away 10 years of clean data.
- 8M customer support emails โ Data Lake. Unstructured text, schema-on-read, needs NLP and exploratory analysis (sentiment, topic modelling, fraud language detection).
- Real-time ATM sensor logs โ Data Lake with streaming ingestion (e.g. Apache Kafka). Real-time patterns require stream processing, not batch SQL. Supports fraud detection and predictive maintenance.
- Bottleneck of 4 teams fighting 1 data team โ Move toward Data Mesh principles: each team (retail, wealth, fraud, credit risk) becomes a domain owning its own data product, with a shared self-serve platform.
Step 2 โ Add Data Marts for the 4 teams: dependent marts built from the warehouse give each team focused, fast access without exposing the whole warehouse.
Step 3 โ The governance caveat: Banks are heavily regulated. Data Mesh principle 4 (federated governance) is non-negotiable here. Central standards on privacy, access control, and regulatory compliance must be enforced even while domains own their pipelines.
Final recommendation: A hybrid architecture using all 4 patterns simultaneously โ warehouse for structured governed data, lake for unstructured/streaming, marts for team-level analytics, mesh principles to remove the bottleneck under strict federated governance.
HIGH PRIORITYFINANCE VIZQ3. A finance lecturer shows you 4 colored pixel panels displaying IBM stock, Dollar rate, Dow Jones, and Gold, over 6 years. Why is this visualization preferred over a simpler line chart? What could you learn from it?
Why colored pixel panels beat a simple line chart โ 5 reasons:
- Data density: The dataset has 16,350 daily values. A line chart becomes unreadable beyond ~1,000 points; a pixel-oriented view shows every value โ one pixel per observation.
- Temporal locality preserved: The Peano-Hilbert curve places nearby-in-time pixels nearby in space, so stable periods appear as colored blocks that eye can instantly see.
- Pattern visibility: Prolonged low/high periods become large single-color patches (purple = low, green = high). Sudden color shifts highlight volatility and shocks.
- Cross-variable comparison: 4 panels side-by-side let you spot simultaneous color shifts โ something a line chart with 4 overlapping lines hides in clutter.
- Global-event detection: When Stock + Index + Currency + Commodity all change color at the same region, it signals a systemic shock (financial crisis, oil shock, policy change, war).
What you could learn from it:
- Periods of market stability (large same-color patches)
- Bull markets (green clusters in Dow Jones)
- Bear markets or low-price periods (purple clusters in IBM or Dow)
- Correlations between markets โ do IBM and Dow move together? Does Gold rise when Dollar falls?
- Global shocks โ 1987 Black Monday or the early 1990s recession could appear as synchronized color transitions across all 4 panels
One-line summary: Simple line charts cannot show 16,000+ points or compare 4 variables visually at once; pixel panels with the Hilbert curve compress this into a visual form the human eye can read in seconds.
W2Q4. Explain the differences between Statistics, Machine Learning, and Data Mining. Why are they said to be "complementary"?
Core differences:
| Aspect | Statistics | Machine Learning | Data Mining |
|---|---|---|---|
| Primary goal | Test hypotheses, explain phenomena | Reproduce data-generating process; predict new cases | Extract business value from data |
| Approach | Top-down (confirmative) | Generalization from examples | Bottom-up (explorative) |
| Data | Primary, experimental, curated | Any, often large-scale | Secondary, observational, from warehouses |
| Model use | Single reference model guided by theory | Multiple competing models | Multiple models, chosen by data fit |
| Domain | Academic research, testing | Prediction, classification | Decision support, strategy |
Why complementary:
- Statistics provides the theoretical grounding and rigor for methods data mining uses.
- Machine learning provides many of the algorithms (neural nets, decision trees) that data mining applies.
- Data mining expands scope and application โ turning them toward business advantage.
- Explorative bottom-up mining can generate hypotheses; confirmative top-down statistics can then test them. Each enhances the other.
Modern data mining explicitly borrows from both: it uses ML algorithms but applies statistical generalization principles (penalizing overly complex models) to avoid the historical critique of "data fishing".
HIGH PRIORITYWAREHOUSEQ5. Define a data warehouse. Explain Inmon's 4 SIVT characteristics with an example of each from a bank context.
Definition (Inmon, 1996): A data warehouse is an integrated collection of data about a collection of subjects, which is not volatile in time and can support decisions taken by management.
The 4 SIVT characteristics:
- Subject-Oriented (S)
Organised around business subjects (Customer, Product, Transaction), NOT around the applications that created the data.
Bank example: The billing application is organised around invoices. The warehouse reorganises this around Customers โ because that's what analysts study. - Integrated (I)
Unifies data from many sources using consistent naming, encoding, and measurement conventions.
Bank example: The retail system codes gender as M/F, the mortgage system uses 1/0, legacy systems use "Male"/"Female". The warehouse unifies all three into one consistent encoding. - Non-Volatile (V)
Data is loaded once and never modified. Changes are recorded as new rows with timestamps.
Bank example: When a customer moves from London to Manchester, the operational CRM updates the record. The warehouse adds a new row with the new address, preserving the old address with its timestamp โ so you can always see where the customer lived on any past date. - Time-Variant (T)
Stores 5โ10 years of history (operational DBs only keep 60โ90 days). Every record has a time dimension.
Bank example: The warehouse lets analysts compare Q3 2023 mortgage performance with Q3 2015 โ trend analysis, seasonality, long-range forecasting. Operational systems have thrown that data away.
Why these properties matter: Together SIVT makes a warehouse trustworthy, consistent, and historically complete โ the three prerequisites for reliable analytics and regulatory reporting.
DATA MARTQ6. Distinguish between Dependent, Independent, and Hybrid data marts. An insurance company has independent marts for claims, underwriting, and finance โ and board meetings always end in arguments over numbers. Explain why.
The 3 types:
- Dependent mart โ derived directly from a central data warehouse. The warehouse is the single source of truth. Changes in business rules propagate automatically. This is the gold standard.
- Independent mart โ built directly from source systems, bypassing the warehouse. Faster to set up but creates its own silo with its own ETL, its own definitions, and its own quality problems.
- Hybrid mart โ pulls from both warehouse and source systems. Adds flexibility (e.g., clean history + real-time streaming) but increases complexity and inconsistency risk.
Why the board arguments happen:
With independent marts, each team built its own extraction from source systems with its own interpretation of the business rules. So:
- The Claims mart calculates "total claims paid" one way (e.g., gross of reinsurance).
- The Underwriting mart calculates it differently (net of reinsurance recoveries).
- The Finance mart calculates it yet a third way (including loss-adjustment expenses).
When the CFO asks "What were our claims costs last quarter?" she gets 3 different numbers from 3 marts, all "correct" by their own definitions. The boardroom debates whose number to believe instead of making decisions.
The fix: Convert to dependent marts derived from a single enterprise data warehouse. Define "claims cost" once in the warehouse; let all marts inherit that definition. The numbers then agree by construction.
DATA LAKEQ7. What is a Data Lake? Explain "schema-on-read" vs "schema-on-write". What is a "data swamp" and how do you prevent it?
Data Lake: a centralized repository that stores all structured, semi-structured, unstructured, and streaming data at any scale in its raw, unprocessed format.
Schema-on-Write (Warehouse) โ the data's structure is defined before it is loaded. ETL cleans and standardizes. Data quality is enforced at write time. Rigid but trustworthy.
Schema-on-Read (Lake) โ data is stored raw. Structure is applied at query time by whoever reads it. The same raw data can be interpreted many ways for different use cases. Flexible but quality varies.
The Data Swamp Problem: A data swamp is a lake that has ingested vast amounts of data with no metadata, no cataloguing, no access controls, no governance. The symptoms:
- Analysts cannot find what they need
- Nobody knows which version of a dataset is current
- Sensitive personal data sits unprotected
- The lake becomes an expensive, unusable mess
Prevention โ 4 things:
- Metadata & data catalog โ every dataset must be documented (what it is, where it came from, who owns it). Tools: Apache Atlas, AWS Glue Data Catalog, Unity Catalog.
- Access controls โ role-based permissions, especially for personal/sensitive data.
- Quality monitoring โ automated checks for freshness, completeness, schema drift.
- Data governance policies โ lifecycle rules, retention, privacy compliance (GDPR).
Key insight: A lake needs as much governance as a warehouse. The governance is just applied at read time instead of write time. Skipping governance is the root cause of every data swamp.
DATA MESHQ8. A global tech company has 40 business units, each waiting weeks for a 15-person central data team to build pipelines. Is Data Mesh the answer? Explain the 4 principles and one risk.
Short answer: Yes, Data Mesh directly addresses this bottleneck โ provided the company invests in a shared platform and governance.
The 4 Principles of Data Mesh (Zhamak Dehghani, 2019):
- Domain Ownership โ each business domain (payments, logistics, customers) owns, manages, and is accountable for its own data. The people who create the data (and understand it best) should own it. No more central gatekeeper.
- Data as a Product โ each domain's data is treated like a software product: clear owners, SLAs for uptime and freshness, documentation, quality standards, versioning. Designed with consumers (other teams) in mind.
- Self-Serve Data Infrastructure โ a shared platform abstracts complexity away. Templates for pipelines, cataloguing tools, access control, quality monitoring, compute infrastructure. Enables autonomy without chaos.
- Federated Computational Governance โ global standards (privacy, security, GDPR, compliance) defined and enforced at enterprise level. Domains choose HOW to implement but cannot opt out. Autonomy WITH shared rules.
Why it fits this case: 40 units queuing for a 15-person team is a textbook bottleneck. With mesh, each unit publishes its own data products through a shared catalogue. Any team can discover and consume any other team's data โ like a marketplace, no central hub.
The main risk โ governance complexity: Decentralization without standards creates anarchy. If each of 40 units defines "active customer" differently, the company ends up with the same "multiple versions of truth" problem that independent marts create. Principle 4 (federated governance) is non-negotiable. For regulated industries (banking, healthcare), this is especially critical โ you cannot outsource GDPR or HIPAA compliance to domain teams.
Other risks: cultural change (domain teams need new skills), upfront platform investment, possible duplication of pipelines across domains.
HIERARCHYQ9. Explain the "intelligence hierarchy" โ Query & Reporting, Data Retrieval, OLAP, Data Mining. Why is Data Mining at the top? How are OLAP and Data Mining related?
The hierarchy (lowest to highest information capacity):
| Tool | What it does | Capacity | Difficulty |
|---|---|---|---|
| Query & Reporting | Retrieve and display data in tables/reports | Lowest | Easiest |
| Data Retrieval | Extract by pre-specified criteria ("all customers who bought A and B") | LowโMedium | Easy |
| OLAP | User tests hypotheses via multidimensional hypercubes | Medium | Medium |
| Data Mining | Discovers unknown relations automatically | Highest | Hardest |
Why Data Mining is at the top: The other tools answer questions you already know to ask. Data Mining discovers patterns you didn't know to look for. It brings together all variables in different ways, finding unknown relations. That's the highest information capacity โ but also the hardest to implement, requiring skilled teams and high-quality data.
The trade-off: there is an inverse relationship between information capacity and ease of implementation. A simple query is quick but tells you only "what happened"; data mining takes months/years but reveals "what patterns exist that we never suspected".
OLAP vs Data Mining โ complementary, not competing:
- OLAP tests user-driven hypotheses ("let me check sales by region by quarter"). The user asks the questions.
- Data Mining uncovers unknown relations automatically. The data reveals the questions.
- OLAP is often used in the preprocessing stages of data mining โ to understand the data, identify special cases, spot principal interrelations before applying mining algorithms.
- Used together they create useful synergies โ OLAP scopes what to mine, mining reveals what OLAP should investigate further.
Important detail: OLAP fails with tens or hundreds of variables โ the hypothesis space becomes too complex for a human to navigate. That's where data mining's automation becomes essential.
MINING TASKSQ10. Distinguish between predictive and descriptive data mining. Give 2 real-world examples of each. Where does "association rules" fit, and why is the beer-and-nappies story famous?
Predictive (Asymmetrical / Supervised / Direct):
Describes one or more target variables in relation to others. There is a known answer to predict.
- Classification โ categorize into known classes. Example: Gmail classifying emails as spam or inbox.
- Regression โ predict a continuous value. Example: Zillow estimating house prices from features.
- Time-series forecasting โ predict future from past. Example: forecasting next month's electricity demand.
Descriptive (Symmetrical / Unsupervised / Indirect):
Describes groups of data more briefly. Observations classified into groups not known beforehand; variables connected through association or graphical models.
- Clustering โ group similar items without labels. Example: a marketing team clustering customers into personas (bargain hunters, loyalists, premium buyers).
- Association rules โ find "if A, then B" co-occurrence patterns. Example: Netflix viewers who watched Stranger Things also watched Dark.
- Anomaly detection โ identify unusual observations. Example: credit card fraud detection.
Where "association rules" fit: They are a local method โ a descriptive technique that identifies characteristics in subsets of the database (not the whole dataset). Giudici classifies them as local, alongside outlier detection.
Why the beer-and-nappies story is famous:
- Walmart mined billions of transactions from their 2.5 petabyte warehouse.
- Discovered: beer + nappies bought together on Friday evenings.
- Investigation revealed new fathers on "emergency supply runs".
- Walmart moved beer displays next to nappies โ sales of both rose.
The lesson: No analyst would have thought to test the beer-nappy hypothesis. The unsupervised association-rule mining found a pattern nobody was looking for. This is the essence of data mining: finding what you were not looking for. It's the canonical example of association-rule value and defines descriptive mining in action.
ATTRIBUTESQ11. A hospital dataset has: Patient ID, Gender, Age, Blood Type, Satisfaction Rating (1โ5), Temperature, Diabetic (yes/no), Postcode, Number of Visits. Classify each attribute type and justify.
| Attribute | Type | Sub-type | Why |
|---|---|---|---|
| Patient ID | Nominal | โ | Category with no order (just a label); arithmetic on IDs is meaningless |
| Gender | Binary (Nominal) | โ | Only 2 states (M/F); subset of nominal |
| Age | Numeric | Continuous (or Discrete in years) | Measurable quantity; can be compared and averaged |
| Blood Type | Nominal | โ | Categories (A, B, AB, O) with no meaningful order |
| Satisfaction Rating (1โ5) | Ordinal | โ | Has meaningful order (5 is better than 1), but gaps between levels may not be equal |
| Temperature | Numeric | Continuous | Real-valued, measurable on a continuous scale |
| Diabetic (yes/no) | Binary (Nominal) | โ | Only 2 states |
| Postcode | Nominal | Discrete (if numeric) | LOOKS like a number, but order/arithmetic is meaningless โ it's a categorical label |
| Number of Visits | Numeric | Discrete | Countable (0, 1, 2, ...); cannot have "2.7 visits" |
Key lesson: Just because something LOOKS like a number doesn't mean it's numeric. Postcodes, Zip codes, Student IDs are stored as numbers but are nominal โ no order, no arithmetic. Asking a mining algorithm to "average postcodes" is nonsense. Always ask: does order matter? does distance make sense? If not, it's nominal (or ordinal at best).
STATISTICSQ12. For the income dataset {20k, 22k, 24k, 25k, 26k, 28k, 30k, 1M}, compute mean, median, mode, and IQR. Which central-tendency measure best describes "typical income"? Why?
Sorted data: 20, 22, 24, 25, 26, 28, 30, 1000 (all in $1000s).
Mean: (20 + 22 + 24 + 25 + 26 + 28 + 30 + 1000) / 8 = 1175 / 8 โ $146.9k
Median: With 8 values, the median is the average of the 4th and 5th sorted values: (25 + 26) / 2 = $25.5k
Mode: No value repeats, so technically no mode (or every value is a mode, depending on convention).
Quartiles and IQR:
- Q1 (25th percentile) = median of lower half {20, 22, 24, 25} = (22 + 24) / 2 = 23
- Q3 (75th percentile) = median of upper half {26, 28, 30, 1000} = (28 + 30) / 2 = 29
- IQR = Q3 โ Q1 = 29 โ 23 = 6 (i.e., $6k)
Best measure of "typical income" โ MEDIAN ($25.5k):
- The mean ($146.9k) is badly distorted by the $1M outlier โ no one in this group actually earns anywhere near $146.9k.
- The median is robust to outliers: removing or changing that $1M value doesn't move the median at all.
- The IQR (6) also shows that the middle 50% of the data is tightly clustered between $23k and $29k โ the outlier is a single extreme value, not representative.
General rule: for skewed data (income, house prices, company revenues), always prefer the median over the mean. It's why governments report "median household income", not "mean".
SIMILARITYQ13. Two customers are described by (Age, Income, PurchaseFrequency). Customer A = (25, 40, 12). Customer B = (30, 55, 18). Compute Euclidean and Manhattan distance. Why might Cosine similarity be inappropriate here?
Differences per dimension:
- |25 โ 30| = 5 (Age)
- |40 โ 55| = 15 (Income)
- |12 โ 18| = 6 (Purchase Frequency)
Manhattan distance (h = 1):
D = 5 + 15 + 6 = 26
Euclidean distance (h = 2):
D = โ(5ยฒ + 15ยฒ + 6ยฒ) = โ(25 + 225 + 36) = โ286 โ 16.91
Why Cosine similarity is inappropriate here:
- Cosine measures the angle between vectors, ignoring magnitude. Two customers with identical purchasing patterns at different scales (e.g., one is 2ร the other on every feature) would have cosine similarity 1 โ deemed "identical".
- But in customer analytics, magnitude matters a lot: a 25-year-old earning $40k with 12 purchases is genuinely different from a 50-year-old earning $80k with 24 purchases, even though they have the same "shape".
- Cosine is designed for sparse, high-dimensional data where direction matters more than magnitude โ most typically text documents, where doc length varies wildly but topic/direction is what counts.
- For dense, low-dimensional numeric data like customer attributes, use Euclidean or Manhattan.
One more caveat โ scaling: Notice that Income (difference of 15) dominates the Euclidean distance because it has a larger numeric range than Age (difference of 5). In practice, before computing distance, normalize or standardize the variables so no single attribute dominates. This is part of KDD step 3 (Transformation).
7-PHASE PROCESSQ14. A retail chain wants to use data mining to reduce customer churn. Walk through Giudici's 7-phase data mining process for this specific scenario.
- A โ Define Objectives: Aim = reduce monthly churn by 15% in 6 months. Specifically, identify customers at high risk of churning 30โ60 days before they leave, so marketing can intervene. This is the MOST critical phase โ all downstream work depends on it.
- B โ Select & Organise Data: Pull from internal sources (cheaper, more reliable). Needed: transaction history, loyalty-card data, support call logs, demographics, marketing interactions. Build a Customer data mart from the warehouse. Data cleansing: handle missing emails, standardize phone formats, remove deceased customers.
- C โ Exploratory Analysis & Transformation: Use OLAP-style analysis: what % churn per store region? per age group? per tenure? Create derived features (recency of last visit, frequency per month, monetary value โ the RFM variables). Detect anomalies: customers with $0 spend but 20 visits (probably card error).
- D โ Specify Methods: This is a predictive task (target = "will churn in next 60 days?"). Candidate methods: decision tree (interpretable, easy to explain), logistic regression (classic baseline), random forest (higher accuracy), neural net (if lots of data). Decision tree might be chosen first for interpretability with marketing team.
- E โ Data Analysis: Train multiple models on historical data (e.g., 2022โ2023). Use cross-validation. Track precision, recall, AUC, and lift at different score thresholds.
- F โ Evaluate & Compare: Compare models on held-out data. Consider also training time, interpretability, ease of deployment, data quality sensitivity. If two models tie on accuracy, choose the more interpretable one for business adoption.
- G โ Interpretation & Implementation: Integrate the chosen model into the CRM: each night, score every customer's churn probability. High-scoring customers get a retention offer (discount, loyalty bonus, personal call). Track actual churn reduction versus baseline over the 6-month period. The virtuous circle of knowledge kicks in โ new churn data flows in, model is retrained.
Implementation phases (from Section 6.1): Strategic (where does mining add most benefit?) โ Training (pilot on 3 stores) โ Creation (scale to all stores) โ Migration (train users, evaluate continuously).
VIZ TECHNIQUESQ15. Name and briefly describe the 4 main categories of data visualization. For each, give one scenario where it is the best choice.
- Pixel-Oriented Visualization
Maps each data value to one colored pixel. Uses space-filling curves (e.g., Peano-Hilbert) to order pixels meaningfully. Maximum data density on a screen.
Best scenario: visualizing 16,000+ daily financial observations across multiple markets โ exactly like the IBM/Dollar/Dow/Gold example. - Geometric Projection
Projects high-dimensional data into lower-dimensional space. Examples: scatter-plot matrix (pairwise scatterplots of all variables), parallel coordinates (each variable = vertical axis, each point = a line crossing them).
Best scenario: exploring relationships among 6โ12 numeric variables in a customer dataset to find clusters or correlations. - Icon-Based Visualization
Encodes multiple attributes as features of small icons. Examples: Chernoff faces (face features encode variables), stick figures.
Best scenario: showing multi-dimensional health patient profiles where humans instinctively read "facial" patterns โ a clinician can spot anomalies faster via Chernoff faces than a table of numbers. - Hierarchical Visualization
Partitions dimensions into subspaces and displays them in nested structures. Examples: tree-maps (nested rectangles sized by value), Worlds-within-Worlds (n-Vision).
Best scenario: visualizing budget allocations across departments, sub-departments, and line items โ each level is a rectangle whose area = spend.
Choosing rule: pixel-oriented for massive time series, geometric projection for relationships among moderate-dimensional numeric data, icon-based for human-pattern-matching, hierarchical for nested/tree-structured data.
METADATAQ16. What is metadata? Why is it called "data about data"? Why does a warehouse without metadata fail, even if the warehouse itself is technically perfect?
Metadata = data ABOUT data. It describes the structure, meaning, history, and ownership of the data itself.
Typical metadata answers:
- How was this variable calculated? (e.g., "monthly_revenue" = sum of completed transactions, excluding refunds, in local currency)
- When did a field definition change? (e.g., "region" was redefined in 2021 when sales territories were restructured)
- Which source system did each record come from?
- Who last modified the record, when, and why?
Why a warehouse without metadata fails:
- The Library analogy: A warehouse without metadata is like a library without a catalogue. The books exist; nobody can find what they need.
- The Definition trap: Two teams compute "active customer" โ one uses "bought in last 30 days", the other uses "logged in last 30 days". Both put numbers in the warehouse. Without metadata, the CFO sees two conflicting counts with no way to know which is correct. Meetings descend into definition arguments.
- Historical changes get lost: If the "region" field was redefined in 2021, any trend analysis crossing 2021 is meaningless unless the change is documented in metadata.
- Compliance failure: GDPR and other regulations require organizations to know where personal data came from and where it's used. Without metadata, you can't answer "the right to be forgotten" requests.
Metadata increases the VALUE of the warehouse because it makes the data reliable, trustworthy, and traceable. Many organizations invest heavily in the warehouse infrastructure but skimp on metadata โ and their warehouse becomes a source of constant arguments rather than clear answers.
COMPARISONQ17. In one table, compare Data Warehouse, Data Mart, Data Lake, and Data Mesh across: data format, users, governance, main risk. In one sentence per architecture, say when you'd choose it.
| Dimension | Data Warehouse | Data Mart | Data Lake | Data Mesh |
|---|---|---|---|---|
| Data format | Structured only | Structured only | Any โ raw | Any โ domain decides |
| Primary users | Analysts, executives | Dept. analysts | Data scientists, ML engineers | Domain product teams |
| Governance | Centralised, strict | Centralised | Often weak (swamp risk) | Federated (shared rules) |
| Main risk | Expensive, slow ETL | Proliferation of inconsistent marts | Data swamp without governance | Governance complexity across domains |
When to choose each โ one-sentence rule:
- Warehouse: when you need reliable, governed, consistent reporting for structured data โ typically finance, compliance, and executive dashboards.
- Mart: when a specific team needs fast, focused access to a subset of warehouse data without navigating the full complexity.
- Lake: when you have large unstructured or semi-structured data that data scientists need raw for ML training, or when you want to store first and decide use later.
- Mesh: when your organization has many autonomous business units, a central data team has become a bottleneck, and domain teams need to move independently.
The Big Insight: These 4 architectures are not mutually exclusive. Every mature large organization (Netflix, Spotify, big banks) runs ALL 4 simultaneously, assigning each data type to the architecture that best fits its characteristics.
ALGORITHMSQ18. Choose an appropriate data-mining algorithm for each scenario and justify: (a) classify email as spam, (b) group supermarket customers into segments, (c) predict tomorrow's electricity demand, (d) find frequently co-purchased product pairs, (e) detect fraudulent credit-card transactions.
(a) Classify email as spam
Task: Predictive โ classification (known label: spam / not-spam).
Algorithm: Naive Bayes (classic baseline) or Logistic Regression. For higher accuracy: Random Forest or a Neural Network on text features.
Why: supervised classification with labeled training data; email text gives high-dimensional features; Naive Bayes is famously effective here.
(b) Group supermarket customers into segments
Task: Descriptive โ clustering (no predefined labels).
Algorithm: K-Means (most popular).
Why: unsupervised clustering of numeric customer features (RFM โ recency, frequency, monetary). Caveat: K-Means is sensitive to the choice of k (number of clusters); use the elbow method or silhouette score to choose.
(c) Predict tomorrow's electricity demand
Task: Predictive โ time series forecasting (continuous target, ordered in time).
Algorithm: classical ARIMA, Exponential Smoothing, or modern LSTM neural networks.
Why: time-ordered data with seasonality (daily, weekly, yearly cycles). Needs a time-series-aware method, not standard regression.
(d) Find frequently co-purchased product pairs
Task: Descriptive โ association rule mining (local method).
Algorithm: Apriori (classic) or FP-Growth (faster on big data).
Why: classic market basket analysis โ the beer-and-nappies problem. Apriori finds rules of the form {A, B} โ {C} with support and confidence measures.
(e) Detect fraudulent credit-card transactions
Task: Can be framed two ways:
- If you have labeled historical fraud examples โ Predictive classification. Use Random Forest, Gradient Boosting, or Neural Networks. Imbalanced classes require resampling (SMOTE) or cost-sensitive learning.
- If fraud patterns are novel and unlabeled โ Anomaly / outlier detection (descriptive, local method). Use Isolation Forest or One-Class SVM.
Key takeaway โ map task type to algorithm family:
- Predictive + known categories โ Classification (trees, forests, neural nets, logistic regression)
- Predictive + numeric target โ Regression (linear, trees, neural nets)
- Predictive + temporal โ Time-series (ARIMA, LSTM)
- Descriptive + grouping โ Clustering (K-Means, hierarchical)
- Descriptive + rules โ Association (Apriori)
- Descriptive + outliers โ Anomaly detection (Isolation Forest)
โ Final 30-Minute Checklist LAST HOUR
With ~30 minutes before the exam, review these fast-recall items:
- Data Mining = process of selection, exploration, modelling of large data to find unknown regularities, for business advantage.
- KDD = 5 steps: Selection โ Preprocessing โ Transformation โ Data Mining โ Interpretation.
- Data Warehouse = Subject-oriented, Integrated, Non-Volatile, Time-Variant (SIVT).
- Data Mart = thematic subset of a warehouse. 3 types: Dependent (best), Independent (risky), Hybrid.
- Data Lake = raw data, any format, schema-on-READ. Risk: data swamp.
- Data Mesh = 4 principles: Domain ownership ยท Data-as-product ยท Self-serve platform ยท Federated governance.
- Intelligence hierarchy: Query & Reporting โ Data Retrieval โ OLAP โ Data Mining.
- Data pipeline: Warehouse โ Webhouse โ Mart โ Matrix.
- 7-Phase DM process: Objectives โ Select & Organise โ Exploratory โ Specify Methods โ Analysis โ Evaluate โ Implement.
- 4 Implementation phases: Strategic โ Training โ Creation โ Migration.
- SIVT: Subject, Integrated, Non-Volatile, Time-Variant
- Attribute types: Nominal, Binary, Ordinal, Numeric
- Visualization categories: Pixel-oriented, Geometric projection, Icon-based, Hierarchical
- Data types in a lake: Structured, Semi-structured, Unstructured, Streaming
- Data Mesh principles: Domain ownership, Data-as-product, Self-serve, Federated governance
- Finance panels: IBM (stock), Dollar (currency), Dow Jones (index), Gold (commodity)
"Why use a picture with colors instead of a simpler one?"
Memorize: Because the dataset has 16,350 daily observations across 4 variables โ a line chart cannot show that density. Pixel-oriented viz + Peano-Hilbert curve keeps temporally close values spatially close, so stable periods become same-color blocks, shocks become sudden color shifts, and simultaneous color changes across all 4 panels reveal global events (crisis, policy, war) that a simple chart cannot show.
๐ Strategy for the 40 minutes of writing
- Minute 0โ3: Read ALL 5 questions. Note which ones you're most confident on.
- Minute 3โ35: Answer each question in roughly 6โ7 minutes. Write confident ones first.
- Minute 35โ40: Re-read, fix typos, add missing examples.
- For every case-study question: (1) DEFINE the key term, (2) APPLY to the case, (3) GIVE an example, (4) STATE risks/trade-offs.
- If stuck: bullet points still earn marks. Don't leave blanks.
๐ You've got this. Good luck! ๐