Data Mining — Final Exam Study Guide

⏱

Your 5-Hour Study Plan

Follow this order. Don't skip the practice — solving questions is what makes you pass. Take a 5-minute break between blocks.

0:00–0:40

Section 1 — Thinking with Data. The 4 statistical tricks. Easy, important, and almost always on the exam.

0:40–1:00

Section 2 — Data Mining vs Machine Learning. Short but examiners love this comparison.

1:00–1:50

Sections 3 & 4 — Linear Regression + Error/Gradient Descent. The biggest topic. Learn the words and the MSE/RMSE idea.

1:50–2:40

Sections 5, 6 & 7 — Overfitting, Polynomial Regression, Parameters. The heart of the final exam.

2:40–3:10

Section 8 — Model Selection. Training/testing/validation, cross-validation, parsimony.

3:10–3:30

Section 9 — Decision Trees. Explainable vs powerful models.

3:30–4:40

Practice Zone. Do every case study and True/False. Write your answer first, THEN reveal.

4:40–5:00

Cheat Sheet + Last 30 Minutes checklist. Lock in the must-know facts.

How the exam looksBased on your course materials, the final has case-study questions (a short story, then questions where you explain your reasoning) and True/False questions (where you also say why). There is no single trick answer — you earn marks by explaining clearly. Never leave a blank.

Learning to Think with Data

This topic is about data literacy — the skill of reading numbers without being fooled by them.

The famous quote“There are three kinds of lies: lies, damned lies, and statistics.”
Why are statistics called the worst kind of lie? Because they are wrapped in the authority of math, so people trust them. The lie is rarely in the number itself. It hides in what was counted, how the sample was chosen, what scale was used, what comparison was implied, and what was left out.

data literacy = the ability to read, understand, and question data and statistics.

sample = the smaller group of people/things you actually measured (you can't measure everyone, so you measure a sample).

The 4 Tricks = The 4 Questions to Ask About Any Number

This is the most important table in this section. Each trick has a matching question you should always ask. The tricks come from the classic book “How to Lie with Statistics” (Darrell Huff, 1954).

#	The Trick	Stage it attacks	The Question to Ask
1	The Sample with the Built-In Bias	Collection (how data is gathered)	“Who is missing from this number?”
2	The Well-Chosen Average	Summary (how data is shortened)	“Which average do they mean — and which are they hiding?”
3	The Gee-Whiz Graph	Presentation (how data is drawn)	“What are the axes doing?”
4	Post Hoc — correlation ≠ causation	Interpretation (what it means)	“Is one thing really causing the other?”

The one question under all four“What is this number trying to make me believe — and who benefits if I believe it?” If you remember only one line from this section, remember this one.

Trick 1 — The Sample with the Built-In Bias

The bias is not in the math. It's “in the door” — in who got to be in the sample.

The Yale 1924 exampleA report said Yale's class of 1924 had an average income of $25,111 (a fortune back then). So Yale must make rich people! But ask: who replied? The number only included people who were (1) still findable 25 years later, (2) willing to fill in an income survey, and (3) proud of their answer. Every filter quietly removed the people with low incomes. It was the average of the findable, willing, and proud — not of the whole class.

You see this everywhere: online reviews (only people who felt strongly write them), election polls (only people who answered the phone), “95% of customers are happy” (the customers they chose to ask).

Trick 2 — The Well-Chosen Average

There are three different “averages,” and they can tell three different stories from the same data.

Type	What it is	Example value	Who likes to use it
Mean	Add everything, divide by how many. The boss's huge salary pulls it UP.	$35K	What the company advertises
Median	The middle value when sorted. Where the typical person actually sits.	$25K	What a labour union would cite
Mode	The most common value. (15 of 23 workers earn this.)	$25K	What “most workers” means

Rule of thumb to memorizeWhen a distribution is skewed (incomes, house prices, response times, social-media followers), the mean and median can disagree a lot. If only the mean is reported, ask why.

skewed = lopsided. A few very big values stretch the data to one side, dragging the mean away from the typical value.

Trick 3 — The Gee-Whiz Graph

The same numbers can look boring or shocking depending on how the graph is drawn.

Same data, two chartsA modest 2% rise drawn two ways: with an honest y-axis starting at 0, it looks like an almost flat line. With a “gee-whiz” axis that starts at 99 and ends at 103, the same 2% looks like an explosion. Identical numbers — totally different feeling.

Always look at the axes:

Where does the y-axis start? If it doesn't start at zero, ask why.
Where does it end? Cropping the top can hide huge changes.
Linear or logarithmic? A log scale flattens dramatic differences.
What's the time window? Cherry-picked dates can flip the conclusion.

Trick 4 — Post Hoc (Correlation ≠ Causation)

“Post hoc ergo propter hoc” is Latin for “after this, therefore because of this.” Just because two things move together does not mean one causes the other.

If X and Y move together, there are FOUR possibilities — and headlines usually pick only the first:

#	Possibility	Example
1	X causes Y (the assumed answer)	Smoking → lower grades?
2	Y causes X (the arrow runs the other way)	Stress from low grades → smoking?
3	Z causes both (a hidden third factor)	Family income → affects both?
4	Pure coincidence (with enough variables, things line up by luck)	Mozzarella eaten ↔ engineering PhDs awarded

Connects to AI / Machine LearningMachine learning understands patterns. It does NOT understand causes. Algorithms find correlations; only humans can ask which way the arrow really runs. Always probe in three directions: Could the arrow run the other way? Could a hidden third factor be pulling both? Could this just be coincidence?

📝 One-line summary for the exam

Data literacy means asking four questions of every number — who is missing (sample), which average is hidden (summary), what the axes are doing (presentation), and whether one thing really causes the other (interpretation) — and underneath all four, “what is this number trying to make me believe, and who benefits?”

Data Mining vs Machine Learning

These two are friends, but they are not the same thing. The exam may give you two studies and ask you to label which is which.

Dimension	Data Mining	Machine Learning
Core purpose	Discover patterns — “what is in this data?”	Make predictions — “what happens with new data?”
Typical output	An understandable pattern or report (e.g. customer clusters)	A trained model that scores new cases
How you judge it	A human interprets the result in business terms	Numerical metrics (accuracy, error)
Role of the human	Central — the human reads the meaning	Smaller — the process is partly automated
Time focus	Looks to the past (what already happened)	Looks to the future (what will happen)

The scope relationship (very important)Machine learning is an indispensable tool of data mining — but not the whole of it. Data mining is a broader umbrella that also uses statistics, visualisation, and database queries. Machine learning is also used outside data mining — in computer vision, natural language processing, and robotics. Neither one is a subset of the other.

Common exam trapIf a manager says “Machine learning completely replaces data mining,” that is false. ML is one powerful tool in the data-mining toolbox; data mining is larger than ML, and ML reaches into areas data mining does not.

Quick reminder: the learning types

Supervised learning = you have labelled data (the answer is given, e.g. “churned / did not churn”). Classification, Regression
Unsupervised learning = no labels; the model finds its own groups, and you don't know the number of groups in advance. Clustering
Reinforcement learning = learning by trial and error using rewards and penalties. (Not the same as the two above.)

label = the known answer attached to each example (like “spam” or “not spam”). “Labelled data” = data that already has answers.

churn = when a customer stops using a service / leaves the company.

Linear Regression

Linear regression is a method to estimate a value — like the price of a house, a stock value, a person's life expectancy, or how long a user will watch a video.

The big idea (road through a town)Imagine the data points are houses in a town, and you want to build a straight road that passes as close as possible to all the houses, because everyone wants to live near the road. The goal of linear regression is to draw the line that passes as close to the points as possible.

The house-price example (learn this — it carries the whole topic)

Look at houses by their number of rooms and notice the pattern:

Rooms	1	2	3	4	5	6	7
Price	150	200	250	?	350	400	450

What price for the 4-room house? Most people say $300 — because each extra room adds $50, starting from a base of $100. You just did linear regression in your head! The rule is:

Price = 100 + 50 × (Number of rooms)base price + (price per room × rooms)

The 6 words you MUST know

Word	Simple meaning	In our example
Feature (variable)	A property we use to make the prediction	Number of rooms (also: size, age, crime rate…)
Label (outcome / target)	The thing we are trying to predict	The price of the house
Model	The rule / formula that turns features into a prediction	The equation for the price
Prediction	The number the model outputs	“$300 for a 4-room house”
Weight (coefficient)	How much each feature is multiplied by	$50 per room
Bias (intercept)	The base value, not attached to any feature	$100 base price

coefficient = another word for weight. intercept = another word for bias. Examiners switch between these words, so know both.

The 3 steps of the algorithm: Remember → Formulate → Predict

Remember: look at the prices of houses we already know.
Formulate: make a rule that estimates the price → Price = 100 + 50(rooms) + small error.
Predict: use the rule on a new house. For 4 rooms: 100 + 50×4 = $300.

Why “+ small error”?The model estimates a price, so it will almost always be a little off — it is very hard to hit the exact price. That's normal. Training means finding the model that makes the smallest errors.

More than one feature: Multivariate Linear Regression

Real houses depend on many features. The model just adds more terms:

Price = 30(rooms) + 1.5(size) + 10(school quality) − 2(age) + 50

Notice the signs of the weights:

Positive weight = positively correlated: more rooms, bigger size, better schools → higher price.
Negative weight = negatively correlated: older house → lower price (so age has a minus sign).
Weight of zero = the feature is irrelevant to the price.

multivariate = “many variables.” More than one feature is used to make the prediction.

How does the computer actually draw the line?

With six houses it's easy. With thousands of houses, we use the linear regression algorithm:

The algorithm in one breath(1) Start with any random line. (2) Find the best direction to move it a little bit closer to the points. (3) Move it a little. (4) Repeat many times → the line slowly fits the data.

The computer moves the line in two ways: rotate it (change the slope/weight) and translate it up or down (change the intercept/bias).

Written with symbols, the model for our simple example is:

p̂ = m·r + bp̂ = predicted price · m = price per room (weight) · r = number of rooms (feature) · b = base price (bias)

Worked example of “improving” the lineSay the model is p̂ = 40·r + 50. A real 2-room house costs $150, but the model predicts 40×2 + 50 = 130 — too low. So we nudge both numbers up a little: price per room +0.50, base +1 → new model p̂ = 40.5·r + 51, which now predicts 132. Closer to 150 → a better model for that point. Repeat this thousands of times = training.

Measuring Error + Gradient Descent

To improve a model, the computer needs a number that says “how bad are you right now?” That number comes from an error function.

Error functionAn error function is a metric that tells us how the model is doing. A big value = bad model (line far from points); a small value = good model (line close to points). It is also called a loss function or a cost function — three names, same idea.

The 3 error formulas

Squared Error = (y_i − ŷ_i)²for ONE point: (real value − predicted value), then squared

MSE = (1/n) · Σ (y_i − ŷ_i)²Mean Squared Error = average of all the squared errors (n = number of points)

RMSE = √[ Σ (ŷ_i − y_i)² / n ]Root Mean Squared Error = the square root of the MSE

Both MSE and RMSE give an idea of how much error the model makes in a prediction. Lower is better.

Why bother with RMSE?If you predict house prices in dollars, the MSE comes out in dollars squared — a strange unit no one understands. Taking the square root brings it back to dollars. So if RMSE = $10,000, you can expect the model to be off by about $10,000 on a typical prediction. That's why RMSE is easier to interpret.

y_i = the real value of point i. ŷ_i (“y-hat”) = the predicted value. Σ (sigma) = “add them all up.” n = how many points.

Gradient Descent — climbing down “Mount Errorest”

Finding the line with the smallest error is called minimizing the function. The trick we use is gradient descent.

The mountain analogyImagine you are on top of a foggy mountain called “Mount Errorest” and you want to get to the bottom (lowest error). The fog means you can only see about one metre around you. What do you do? You look around, find the direction that goes down the most, take one small step, and repeat. Step by step, you reach the bottom.

That's exactly how the computer trains a regression model:

Start with any line.
Find the best direction to move the line a little, using the RMSE function.
Move the line a little in that direction.
Repeat many times.

Small honest noteGradient descent doesn't always find the exact lowest point, but in practice it gets very close, and it is fast and effective.

Underfitting & Overfitting

This is the heart of the final exam. Imagine you build a model, put it to work, and the predictions are bad. What went wrong? Usually one of these two problems.

The exam-studying analogy (gold — use it in your answers) Underfitting = you didn't study enough. The model is too simple and never learned the data.

Overfitting = you memorized the whole textbook word-for-word instead of understanding it. The model is too complex; it memorizes the data instead of learning the pattern, so it fails on new questions.

A good model = you studied properly. It learned the real pattern and can answer new, unseen questions.

Said another way, you can make two mistakes: oversimplify the problem (too simple) or overcomplicate it (too complex).

The key fact: simple vs complex

Very simple models tend to UNDERFIT.
Very complex models tend to OVERFIT.
The goal = a model that is neither too simple nor too complex, one that captures the essence of the data.

The real danger of overfittingThe real problem isn't that an overfit model fits the training data badly — it fits it perfectly! The problem is it does not generalise to new data. It memorized every point without understanding the pattern, so on fresh data its predictions look horrible.

generalise = to work well on new, unseen data — not just on the data the model was trained on. This is the whole point of a good model.

The three-models picture (memorize this)

Model	What it does	Diagnosis
Model 1 — degree 1 (a line)	Too simple. A straight line trying to fit curved data.	🔴 Underfitting
Model 2 — degree 2 (a parabola)	Fits the data well; follows the trend without chasing every point.	🟢 Good model
Model 3 — degree 10 (wild curve)	Passes through every single point but swings wildly; learns noise, not signal.	🟡 Overfitting

noise = the random, meaningless wobble in real data. We want the model to learn the signal (the true pattern), not the noise.

Polynomial Regression

Linear regression draws a straight line. But what if the data is curved (nonlinear)? Then we use polynomial regression — a powerful extension that can bend.

nonlinear = not shaped like a straight line; curved.

polynomial = a formula built from powers of a variable: 1, x, x², x³, … Each higher power lets the curve bend more.

Degree = the highest power. It controls how much the curve bends.

Degree	Example	Shape	Bends
0	y = 4	Flat horizontal line	0
1	y = 3x + 2	Straight slanted line	0
2	y = x² − 2x + 5	Parabola (one bend)	1
3	y = 2x³ + 8x² − 40	Cubic (two bends)	2

Rule to rememberA polynomial of degree d draws a curve that bends (oscillates) at most d − 1 times. So higher degree = more wiggles = more complex = more risk of overfitting.

⭐ The trickiest exam idea: polynomial regression is STILL linear regression

A teammate might insist: “Polynomial regression makes a curved line, so it can't be the same kind of model as linear regression!” They are wrong. Here's the simple explanation:

Why it's still “linear”Polynomial regression just adds new columns — x, x², x³, and so on — all built from the original variable. The model is still a weighted sum of these columns (each multiplied by a coefficient and added). So it is linear in its coefficients, and it is fitted with the exact same least-squares method as linear regression. The curve comes from the powered inputs, not from any change in how the coefficients are found.

linear in its coefficients = the weights are just multiplied and added (never multiplied by each other or put inside powers). That's the technical reason it counts as “linear” regression even when the line is curved.

The caveat: you must choose the degree BEFORE training

Do we want a line (degree 1), a parabola (degree 2), a cubic (degree 3), or a wild degree 50? A human can “eyeball” the right shape from a scatterplot, but a computer cannot eyeball — it must try many degrees and pick the best one. How it picks the best one is the next section (Model Selection).

Parameters vs Hyperparameters

These two words sound alike and are easy to mix up. There is a simple rule to tell them apart.

	Parameter	Hyperparameter
What it is	The numbers the model learns by itself	The settings you choose before training
Examples	Weights and bias (coefficients)	Polynomial degree, learning rate
When it is set	DURING training (the model creates/modifies it)	BEFORE training (you set it as a knob)

The one-line ruleSet it BEFORE training → it's a hyperparameter. Created/changed DURING training → it's a parameter.

Why it mattersChoosing the right hyperparameters is very important. If you pick them badly (e.g. too high a polynomial degree), you push the model toward overfitting; too low, toward underfitting.

learning rate = a hyperparameter that controls how big each step is during gradient descent (how fast the line moves while training).

Model Selection: Testing, Validation & Cross-Validation

The problem we must solve first

The computer can only measure error (MSE/RMSE). But here's the catch:

Model	Error on the data it was trained on
Model 1 (underfit)	Large
Model 2 (good)	Small
Model 3 (overfit)	Zero! (passes through every point)

The trapIf we judge only by training error, the computer thinks the overfit Model 3 is perfect (its error is zero). That's wrong! We need a way to expose overfitting. The answer is testing.

The solution: split your data into a Training set and a Testing set

Training set vs Testing setTraining set = the points we use to build (train) the model. Testing set = a small set of points we hold back and do NOT train on, used only to check how the model does on data it has never seen.

Exam analogyThe textbook has 100 practice questions. You pick 80 to study (look up the answers, learn them) — that's the training set. You save the other 20 to test yourself, answering them without looking — that's the testing set. If you can answer the 20 unseen questions, you truly learned; if you only memorized the 80, you'll fail the 20.

⭐ The most important table in the whole course

This is how testing exposes each problem. Memorize this pattern.

Model	Training Error	Testing Error	Diagnosis
Model 1 (too simple)	High	High	🔴 Underfitting
Model 2 (just right)	Low	Low	🟢 Good model
Model 3 (too complex)	Very Low	High	🟡 Overfitting

The two signatures — say these in the exam Underfitting = error is high on BOTH training and testing (the model is too simple to learn even the data it saw).
Overfitting = error is low on training but HIGH on testing (the tell-tale gap between a tiny training error and a big testing error).

Validation set & the model-complexity idea

To choose the best degree, we hold out a validation set and compare how each candidate model does on it. Then we pick the degree where the validation error is lowest.

Why training error can't choose the degreeTraining error always keeps dropping as you add more polynomial terms — so it will always favour the most complex model. Only held-out data (validation/testing) reveals whether a model truly generalises.

The Principle of ParsimonyWhen two models perform about equally well on validation, always choose the simpler (lower-degree) one. It is easier to explain and less likely to overfit. (“Parsimony” means preferring the simplest explanation.)

Worked example — reading a training vs validation table

A team fits polynomials of increasing degree and records RMSE (lower = better) on the training set and on a separate validation set:

Degree	Training RMSE	Validation RMSE	Reading
1	142	145	Both high → underfit
2	95	98	Both still high → underfit
3	71	73	Lowest validation, simple → BEST
4	68	72	About tied with 3 → 3 wins (parsimony)
6	60	88	Gap opening → overfitting starts
9	41	130	Big gap → overfit
12	22	210	Huge gap → severe overfit

How to answer thisUnderfitting: degrees 1 and 2 (error high on both columns). Overfitting: degrees 6, 9, 12 (training keeps falling while validation shoots up — the widening gap). Best model: degree 3 (lowest validation error, and simpler than the near-tied degree 4). The intern who picks degree 12 “because training RMSE is lowest (22)” is wrong — training error always falls as complexity rises, so it can't reveal overfitting.

k-fold Cross-Validation

What it isk-fold cross-validation splits the data into k equal parts. Each part takes a turn as the test fold while the model trains on the other k − 1 parts. The error is then averaged across all folds.

Why it's better: it doesn't judge a model on a single lucky/unlucky split. It gives a more honest estimate of performance on unseen data, and it exposes degrees that only looked good because they happened to overfit one particular split.

Decision Trees: Explainable vs Powerful

Sometimes the best model isn't the most accurate one — it's the one you can explain. This trade-off is a favourite exam case study.

	Simple model (Decision Tree / Linear Regression)	Complex “Black Box” model
Reasoning	Readable rules: “this condition + this condition → reject”	Hidden — no one can fully explain why it decided
Accuracy	Usually a bit lower	Usually higher
Can you justify a decision?	✅ Yes — you can show and visualise the rule	❌ Often no

black box = a model whose inner reasoning we can't see or explain; we only see its input and output.

interpretable = a human can understand and explain how the model reached its decision.

Decision tree — advantages and disadvantages

AdvantagesUnderstandable model structure · can be visualised · needs no normalisation of the data · works with both numerical and categorical data.

DisadvantagesIt is unstable — small changes in the data can change the tree a lot · it is prone to overfitting.

normalisation = rescaling features to a common range (e.g. 0 to 1) so one big-numbered feature doesn't dominate. Trees don't need this; many other models do.

categorical data = data in categories/labels (like “red / green / blue” or “yes / no”), not numbers.

The bank loan example (the classic case)

The tensionA bank must choose between an interpretable model (you can tell a rejected customer exactly why) and a more accurate black-box model (better predictions, but no explanation). The regulator requires a justification for every rejection. Because the decision tree is interpretable and can be visualised, it directly meets that legal duty — so the black-box model can fall foul of regulation, even though it is more accurate.

Is there a cost to choosing the simple tree? Yes — you give up some accuracy, which becomes unacceptable when accuracy is critical (e.g. detecting a serious disease). A middle path: use the powerful model for the prediction and a simple model to explain the decision afterwards — trying to get the best of both worlds.

✍️

Practice Zone

How to use thisRead each question and write or say your full answer first. Only then tap “Show answer.” The marks come from explaining your reasoning, so practise the explanation out loud — not just the final word.

Case Study 1 — Housing Prices

You are a data scientist at a real-estate company. You fit a simple straight-line (linear) model to predict house price from size. On new houses you notice the predictions are consistently wrong for very small and very large houses — the price increase actually slows down for huge properties, but your line assumes a constant rate.

Q: What problem is your model suffering from, and what model would you recommend instead — and why?

Problem: the model is underfitting. A straight line is too simple to capture the curved (nonlinear) relationship between size and price, where the rate of increase slows for large houses.

Recommendation: use polynomial regression (likely a quadratic, degree 2). A polynomial can bend, so it can follow the curve and capture the slowing price increase for larger properties. Start with a low degree and only increase it if validation performance justifies it.

Case Study 2 — Reading a Training vs Validation Table

An analyst fits polynomials of increasing degree to the same data and records RMSE (lower = better):

Degree	Training RMSE	Validation RMSE
1	142	145
2	95	98
3	71	73
4	68	72
6	60	88
9	41	130
12	22	210

(a) Which degrees underfit? (b) Which overfit, and what pattern tells you? (c) Which one would you deploy? (d) The intern says degree 12 is best because it has the lowest training RMSE (22). Why is that misleading?

(a) Underfitting: degrees 1 and 2. Their error is large on both training and validation (142/145 and 95/98) — the model is too simple to capture the pattern even on data it was fit to.

(b) Overfitting: degrees 6, 9, and 12. The tell-tale sign is the widening gap between a small/shrinking training error and a much larger validation error (degree 12 drops to 22 on training but jumps to 210 on validation).

(c) Deploy degree 3 (degree 4 is arguable). Validation RMSE is lowest and flattest there (73 at degree 3, 72 at degree 4), so it generalises best; degree 3 is the simpler of the two near-tied options (principle of parsimony).

(d) Training RMSE always keeps dropping as the model grows more complex, so it rewards complexity for its own sake and cannot reveal overfitting. Only performance on held-out data shows whether the model learned a general pattern or just memorised the training sample.

Case Study 3 — Linear or Polynomial? (choosing the starting point)

Problem A: predicting salary from years of experience — the scatter looks roughly straight, and the residuals from a straight-line fit scatter randomly around zero.
Problem B: predicting engine fuel efficiency from engine speed (RPM) — the scatter rises, peaks, then falls, and the residuals form a clear upside-down-U.

(a) For each, would you start with linear or polynomial regression, and why? (b) A colleague wants to jump straight to a degree-10 polynomial for Problem B. What's the risk, and what's better?

(a) Problem A → linear regression. The trend is straight and the residuals show no leftover pattern, so a line is adequate. Problem B → polynomial regression. The rise-peak-fall shape and the curved (upside-down-U) residual pattern show a straight line systematically misses the structure.

(b) Jumping to degree 10 risks overfitting: such a flexible curve can chase noise and swing wildly, giving good training error but poor performance on new data. Better to start low (degree 2 or 3, which can already make a single peak) and increase the degree only if validation performance justifies it.

Case Study 4 — Data Mining or Machine Learning? (TelCom)

A telecom company runs two studies.
Study A: scans 5 years of subscriber data without labels, asking “what natural customer groups exist?” Four clusters emerge; the team interprets one as “high-spending but close to contract expiry” and writes a report.
Study B: using those clusters as an input, a model is trained on subscribers labelled “churned / did not churn,” and for each new subscriber it outputs “62% probability of churning next month,” feeding an automated campaign.

(a) Classify A and B (data mining vs machine learning) across at least four dimensions. (b) Rebut a manager who says “machine learning completely replaces data mining.”

(a) Study A is data mining: purpose = pattern discovery; question = “what is in this data?”; output = an interpretable pattern/report (four clusters); evaluation = human interpretation in business terms; the human role is central; it looks to the past. It uses clustering = unsupervised (no labels, number of groups unknown in advance).

Study B is machine learning: purpose = prediction/generalisation; question = “what happens with new data?”; output = a trained model that scores new cases; evaluation = numerical metrics; the process is partly automated; it looks to the future. It uses classification = supervised (trained on the “churned / did not churn” label).

(b) Rebuttal: Machine learning is an indispensable tool of data mining, but not the whole of it. Data mining is a broader umbrella that also includes statistics, visualisation, and database queries; machine learning is also used in computer vision, NLP, and robotics. Neither is a subset of the other, so ML cannot “replace” data mining.

Case Study 5 — Explainable or Powerful? (bank loans)

Your bank debates two loan-approval models: (1) a simple decision tree / linear regression giving readable rules like “this condition + this condition → reject,” which can be shown to auditors; (2) a more accurate but black-box model whose reasoning no one can explain. The regulator requires that every rejected customer be given a justification.

Q: Which should the bank choose, and why? Acknowledge the cost of your choice. Is there a middle path?

Choose the interpretable model (decision tree / linear regression). Because it is interpretable and can be visualised, it directly meets the regulator's duty to justify every rejection. The black-box model, although more accurate, can't easily explain its outputs and so can fall foul of regulation.

Advantages of the tree: understandable structure, can be visualised, needs no normalisation, works with numerical and categorical data. Disadvantages: unstable under small data changes and prone to overfitting.

The cost of choosing simple: you give up some accuracy — which becomes unacceptable where accuracy is critical (e.g. medical diagnosis). Middle path: use the powerful model for the actual decision and a simple model to produce the explanation — aiming for the best of both worlds.

Case Study 6 — The “Press Release” (data literacy)

A startup's investor pitch makes four claims:
① “96% of customers said they were satisfied.” (Survey was a pop-up shown right after a successful purchase.)
② “Our employees' average income is 85,000 — well above industry.” (2 founders, 3 managers, 25 field workers; most field workers earn ~30,000.)
③ A chart titled “Our revenue has exploded!” — the y-axis runs from 99 to 103, the real rise is 2%.
④ “Among users who downloaded our app, satisfaction rose 40%. So our app increases satisfaction.”

(a) Match each claim to one of the four tricks. (b) What is the common thread, and who benefits?

(a) Matching:

① → Biased Sample (Trick 1). Only happy, just-succeeded users who were still in the app got asked — like the “findable, willing, proud” Yale graduates. Ask: who is missing?
② → Well-Chosen Average (Trick 2). The mean is pulled up by the two founders; the median represents the typical worker. Ask: which average, and which is hidden?
③ → Gee-Whiz Graph (Trick 3). The y-axis starts at 99, not 0, so a 2% rise looks like an explosion. Ask: what are the axes doing?
④ → Correlation ≠ Causation / Post Hoc (Trick 4). People who download the app are already more interested (hidden third factor); the arrow could even run the other way. Ask: is one really causing the other?

(b) Common thread: all four use the mathematical correctness of a number to make the company look more successful, generous, or fast-growing than it really is. The company seeking investment benefits if the investor believes them. The lie hides not in the number but in what is counted, who is asked, which scale is chosen, and what is left out.

True / False — with reasons

Decide True or False and say why, then reveal.

1. A polynomial regression model of degree 1 is the same as a simple linear regression model.
TRUE — a first-degree polynomial, b₀ + b₁x, is exactly a straight-line model.
2. Underfitting occurs when a model is so flexible that it captures the random noise in the training data.
FALSE — that describes overfitting. Underfitting is when the model is too simple to capture the pattern.
3. An overfit model typically has low error on the training data but high error on new, unseen data.
TRUE — fitting the noise gives low training error but the learned wiggles don't transfer to new data.
4. Increasing the degree of a polynomial can only ever improve a model's ability to generalise to new data.
FALSE — beyond a point, higher degree captures noise and generalisation gets worse.
5. If a model performs poorly on BOTH the training and the validation data, this is a sign of underfitting.
TRUE — poor on both means the model is too simple → underfitting.
6. The model with the lowest training error is always the best choice for predicting on new data.
FALSE — lowest training error usually favours the most complex model, which may overfit; held-out error is the right guide.
7. Adding more polynomial terms will never increase the error measured on the training set.
TRUE — for least-squares polynomial fits, training error falls or stays equal as terms are added (it never goes up).
8. When two competing models give nearly the same validation error, the higher-degree model should always be preferred.
FALSE — when validation error is about equal, prefer the simpler (lower-degree) model — the principle of parsimony.
9. RMSE is reported in the same units as the thing being predicted, which makes it easier to interpret than MSE.
TRUE — MSE is in squared units; taking the square root returns the original unit (e.g. dollars).
10. Polynomial regression cannot be fitted with the same least-squares procedure as linear regression because its line is curved.
FALSE — it just adds powered columns (x, x², …); the model stays linear in its coefficients and uses the same least-squares method.
11. A feature whose weight (coefficient) is zero has no effect on the prediction.
TRUE — a zero weight means the feature is irrelevant to the outcome.
12. The polynomial degree is a parameter the model learns by itself during training.
FALSE — the degree is a hyperparameter you choose before training. The weights and bias are the learned parameters.
13. Machine learning understands the causes behind the data, so it can prove that one variable causes another.
FALSE — ML finds patterns/correlations, not causes. Only humans can ask which way the causal arrow runs.
14. Machine learning is just one tool used within data mining, and neither is a subset of the other.
TRUE — ML is a key tool of data mining but also used elsewhere (vision, NLP, robotics); the two overlap without either containing the other.

🎯

One-Page Cheat Sheet

The 4 data-literacy questions

① Who is missing? (sample) ② Which average is hidden? (summary) ③ What are the axes doing? (presentation) ④ Is one thing really causing the other? (interpretation). Under all four: what does this number want me to believe, and who benefits?

Linear regression vocabulary

Feature = input · Label = what we predict · Weight/coefficient = multiplier · Bias/intercept = base value · model formula p̂ = m·r + b · steps = Remember → Formulate → Predict.

Correlation signs

Positive weight = positively correlated (↑ feature ↑ price) · Negative weight = negatively correlated (e.g. age) · Zero weight = irrelevant feature.

Error metrics

MSE = average of squared errors · RMSE = √MSE (same units as the target, easier to read; RMSE = $10k means ~$10k off per prediction). Error function = loss function = cost function. Minimising it = gradient descent (climbing down “Mount Errorest”).

Underfitting vs Overfitting — the signatures

Underfit = too simple = error HIGH on training AND testing. Overfit = too complex = error LOW on training but HIGH on testing (the gap!). Good model = LOW on both.

Polynomial regression

For curved data. Degree = highest power; bends ≤ (degree − 1). Still linear in its coefficients → same least-squares method. Choose the degree before training (it's a hyperparameter).

Parameters vs hyperparameters

Set before training → hyperparameter (degree, learning rate). Learned during training → parameter (weights, bias).

Model selection

Split into training + testing; use a validation set to pick the degree (lowest validation error). Training error always falls with complexity → can't reveal overfitting. Ties → pick the simpler model (parsimony). k-fold cross-validation = k parts, each takes a turn as test, average the errors → honest estimate.

Decision trees

+ interpretable, visualisable, no normalisation, handles numerical + categorical. − unstable, overfits. Choose interpretable when you must justify decisions (regulation); black box is accurate but unexplainable.

Data mining vs ML

Mining = discover patterns, past, human reads meaning. ML = predict, future, automated. ML is a tool of mining; neither is a subset of the other. Clustering = unsupervised; classification = supervised.

Your 5-Hour Study Plan

Learning to Think with Data

The 4 Tricks = The 4 Questions to Ask About Any Number

Trick 1 — The Sample with the Built-In Bias

Trick 2 — The Well-Chosen Average

Trick 3 — The Gee-Whiz Graph

Trick 4 — Post Hoc (Correlation ≠ Causation)

Data Mining vs Machine Learning

Quick reminder: the learning types

Linear Regression

The house-price example (learn this — it carries the whole topic)

The 6 words you MUST know

The 3 steps of the algorithm: Remember → Formulate → Predict

More than one feature: Multivariate Linear Regression

How does the computer actually draw the line?

Measuring Error + Gradient Descent

The 3 error formulas

Gradient Descent — climbing down “Mount Errorest”

Underfitting & Overfitting

The key fact: simple vs complex

The three-models picture (memorize this)

Polynomial Regression

Degree = the highest power. It controls how much the curve bends.

⭐ The trickiest exam idea: polynomial regression is STILL linear regression

The caveat: you must choose the degree BEFORE training

Parameters vs Hyperparameters

Model Selection: Testing, Validation & Cross-Validation

The problem we must solve first

The solution: split your data into a Training set and a Testing set

⭐ The most important table in the whole course

Validation set & the model-complexity idea

Worked example — reading a training vs validation table

k-fold Cross-Validation

Decision Trees: Explainable vs Powerful

Decision tree — advantages and disadvantages

The bank loan example (the classic case)

Practice Zone

Case Study 1 — Housing Prices

Case Study 2 — Reading a Training vs Validation Table

Case Study 3 — Linear or Polynomial? (choosing the starting point)

Case Study 4 — Data Mining or Machine Learning? (TelCom)

Case Study 5 — Explainable or Powerful? (bank loans)

Case Study 6 — The “Press Release” (data literacy)

True / False — with reasons

One-Page Cheat Sheet

The 4 data-literacy questions

Linear regression vocabulary

Correlation signs

Error metrics

Underfitting vs Overfitting — the signatures

Polynomial regression

Parameters vs hyperparameters

Model selection

Decision trees

Data mining vs ML

Last 30 Minutes Checklist

You're ready. 💪