Everything you need to understand the final — even if you have never opened the course before. Written in simple English, built for 5 hours of focused study.
Follow this order. Don't skip the practice — solving questions is what makes you pass. Take a 5-minute break between blocks.
This topic is about data literacy — the skill of reading numbers without being fooled by them.
This is the most important table in this section. Each trick has a matching question you should always ask. The tricks come from the classic book “How to Lie with Statistics” (Darrell Huff, 1954).
| # | The Trick | Stage it attacks | The Question to Ask |
|---|---|---|---|
| 1 | The Sample with the Built-In Bias | Collection (how data is gathered) | “Who is missing from this number?” |
| 2 | The Well-Chosen Average | Summary (how data is shortened) | “Which average do they mean — and which are they hiding?” |
| 3 | The Gee-Whiz Graph | Presentation (how data is drawn) | “What are the axes doing?” |
| 4 | Post Hoc — correlation ≠ causation | Interpretation (what it means) | “Is one thing really causing the other?” |
The bias is not in the math. It's “in the door” — in who got to be in the sample.
You see this everywhere: online reviews (only people who felt strongly write them), election polls (only people who answered the phone), “95% of customers are happy” (the customers they chose to ask).
There are three different “averages,” and they can tell three different stories from the same data.
| Type | What it is | Example value | Who likes to use it |
|---|---|---|---|
| Mean | Add everything, divide by how many. The boss's huge salary pulls it UP. | $35K | What the company advertises |
| Median | The middle value when sorted. Where the typical person actually sits. | $25K | What a labour union would cite |
| Mode | The most common value. (15 of 23 workers earn this.) | $25K | What “most workers” means |
The same numbers can look boring or shocking depending on how the graph is drawn.
Always look at the axes:
“Post hoc ergo propter hoc” is Latin for “after this, therefore because of this.” Just because two things move together does not mean one causes the other.
If X and Y move together, there are FOUR possibilities — and headlines usually pick only the first:
| # | Possibility | Example |
|---|---|---|
| 1 | X causes Y (the assumed answer) | Smoking → lower grades? |
| 2 | Y causes X (the arrow runs the other way) | Stress from low grades → smoking? |
| 3 | Z causes both (a hidden third factor) | Family income → affects both? |
| 4 | Pure coincidence (with enough variables, things line up by luck) | Mozzarella eaten ↔ engineering PhDs awarded |
These two are friends, but they are not the same thing. The exam may give you two studies and ask you to label which is which.
| Dimension | Data Mining | Machine Learning |
|---|---|---|
| Core purpose | Discover patterns — “what is in this data?” | Make predictions — “what happens with new data?” |
| Typical output | An understandable pattern or report (e.g. customer clusters) | A trained model that scores new cases |
| How you judge it | A human interprets the result in business terms | Numerical metrics (accuracy, error) |
| Role of the human | Central — the human reads the meaning | Smaller — the process is partly automated |
| Time focus | Looks to the past (what already happened) | Looks to the future (what will happen) |
Linear regression is a method to estimate a value — like the price of a house, a stock value, a person's life expectancy, or how long a user will watch a video.
Look at houses by their number of rooms and notice the pattern:
| Rooms | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
|---|---|---|---|---|---|---|---|
| Price | 150 | 200 | 250 | ? | 350 | 400 | 450 |
What price for the 4-room house? Most people say $300 — because each extra room adds $50, starting from a base of $100. You just did linear regression in your head! The rule is:
| Word | Simple meaning | In our example |
|---|---|---|
| Feature (variable) | A property we use to make the prediction | Number of rooms (also: size, age, crime rate…) |
| Label (outcome / target) | The thing we are trying to predict | The price of the house |
| Model | The rule / formula that turns features into a prediction | The equation for the price |
| Prediction | The number the model outputs | “$300 for a 4-room house” |
| Weight (coefficient) | How much each feature is multiplied by | $50 per room |
| Bias (intercept) | The base value, not attached to any feature | $100 base price |
Real houses depend on many features. The model just adds more terms:
Notice the signs of the weights:
With six houses it's easy. With thousands of houses, we use the linear regression algorithm:
The computer moves the line in two ways: rotate it (change the slope/weight) and translate it up or down (change the intercept/bias).
Written with symbols, the model for our simple example is:
To improve a model, the computer needs a number that says “how bad are you right now?” That number comes from an error function.
Both MSE and RMSE give an idea of how much error the model makes in a prediction. Lower is better.
Finding the line with the smallest error is called minimizing the function. The trick we use is gradient descent.
That's exactly how the computer trains a regression model:
This is the heart of the final exam. Imagine you build a model, put it to work, and the predictions are bad. What went wrong? Usually one of these two problems.
Said another way, you can make two mistakes: oversimplify the problem (too simple) or overcomplicate it (too complex).
| Model | What it does | Diagnosis |
|---|---|---|
| Model 1 — degree 1 (a line) | Too simple. A straight line trying to fit curved data. | 🔴 Underfitting |
| Model 2 — degree 2 (a parabola) | Fits the data well; follows the trend without chasing every point. | 🟢 Good model |
| Model 3 — degree 10 (wild curve) | Passes through every single point but swings wildly; learns noise, not signal. | 🟡 Overfitting |
Linear regression draws a straight line. But what if the data is curved (nonlinear)? Then we use polynomial regression — a powerful extension that can bend.
| Degree | Example | Shape | Bends |
|---|---|---|---|
| 0 | y = 4 | Flat horizontal line | 0 |
| 1 | y = 3x + 2 | Straight slanted line | 0 |
| 2 | y = x² − 2x + 5 | Parabola (one bend) | 1 |
| 3 | y = 2x³ + 8x² − 40 | Cubic (two bends) | 2 |
A teammate might insist: “Polynomial regression makes a curved line, so it can't be the same kind of model as linear regression!” They are wrong. Here's the simple explanation:
Do we want a line (degree 1), a parabola (degree 2), a cubic (degree 3), or a wild degree 50? A human can “eyeball” the right shape from a scatterplot, but a computer cannot eyeball — it must try many degrees and pick the best one. How it picks the best one is the next section (Model Selection).
These two words sound alike and are easy to mix up. There is a simple rule to tell them apart.
| Parameter | Hyperparameter | |
|---|---|---|
| What it is | The numbers the model learns by itself | The settings you choose before training |
| Examples | Weights and bias (coefficients) | Polynomial degree, learning rate |
| When it is set | DURING training (the model creates/modifies it) | BEFORE training (you set it as a knob) |
The computer can only measure error (MSE/RMSE). But here's the catch:
| Model | Error on the data it was trained on |
|---|---|
| Model 1 (underfit) | Large |
| Model 2 (good) | Small |
| Model 3 (overfit) | Zero! (passes through every point) |
This is how testing exposes each problem. Memorize this pattern.
| Model | Training Error | Testing Error | Diagnosis |
|---|---|---|---|
| Model 1 (too simple) | High | High | 🔴 Underfitting |
| Model 2 (just right) | Low | Low | 🟢 Good model |
| Model 3 (too complex) | Very Low | High | 🟡 Overfitting |
To choose the best degree, we hold out a validation set and compare how each candidate model does on it. Then we pick the degree where the validation error is lowest.
A team fits polynomials of increasing degree and records RMSE (lower = better) on the training set and on a separate validation set:
| Degree | Training RMSE | Validation RMSE | Reading |
|---|---|---|---|
| 1 | 142 | 145 | Both high → underfit |
| 2 | 95 | 98 | Both still high → underfit |
| 3 | 71 | 73 | Lowest validation, simple → BEST |
| 4 | 68 | 72 | About tied with 3 → 3 wins (parsimony) |
| 6 | 60 | 88 | Gap opening → overfitting starts |
| 9 | 41 | 130 | Big gap → overfit |
| 12 | 22 | 210 | Huge gap → severe overfit |
Why it's better: it doesn't judge a model on a single lucky/unlucky split. It gives a more honest estimate of performance on unseen data, and it exposes degrees that only looked good because they happened to overfit one particular split.
Sometimes the best model isn't the most accurate one — it's the one you can explain. This trade-off is a favourite exam case study.
| Simple model (Decision Tree / Linear Regression) | Complex “Black Box” model | |
|---|---|---|
| Reasoning | Readable rules: “this condition + this condition → reject” | Hidden — no one can fully explain why it decided |
| Accuracy | Usually a bit lower | Usually higher |
| Can you justify a decision? | ✅ Yes — you can show and visualise the rule | ❌ Often no |
Is there a cost to choosing the simple tree? Yes — you give up some accuracy, which becomes unacceptable when accuracy is critical (e.g. detecting a serious disease). A middle path: use the powerful model for the prediction and a simple model to explain the decision afterwards — trying to get the best of both worlds.
Problem: the model is underfitting. A straight line is too simple to capture the curved (nonlinear) relationship between size and price, where the rate of increase slows for large houses.
Recommendation: use polynomial regression (likely a quadratic, degree 2). A polynomial can bend, so it can follow the curve and capture the slowing price increase for larger properties. Start with a low degree and only increase it if validation performance justifies it.
| Degree | Training RMSE | Validation RMSE |
|---|---|---|
| 1 | 142 | 145 |
| 2 | 95 | 98 |
| 3 | 71 | 73 |
| 4 | 68 | 72 |
| 6 | 60 | 88 |
| 9 | 41 | 130 |
| 12 | 22 | 210 |
(a) Underfitting: degrees 1 and 2. Their error is large on both training and validation (142/145 and 95/98) — the model is too simple to capture the pattern even on data it was fit to.
(b) Overfitting: degrees 6, 9, and 12. The tell-tale sign is the widening gap between a small/shrinking training error and a much larger validation error (degree 12 drops to 22 on training but jumps to 210 on validation).
(c) Deploy degree 3 (degree 4 is arguable). Validation RMSE is lowest and flattest there (73 at degree 3, 72 at degree 4), so it generalises best; degree 3 is the simpler of the two near-tied options (principle of parsimony).
(d) Training RMSE always keeps dropping as the model grows more complex, so it rewards complexity for its own sake and cannot reveal overfitting. Only performance on held-out data shows whether the model learned a general pattern or just memorised the training sample.
(a) Problem A → linear regression. The trend is straight and the residuals show no leftover pattern, so a line is adequate. Problem B → polynomial regression. The rise-peak-fall shape and the curved (upside-down-U) residual pattern show a straight line systematically misses the structure.
(b) Jumping to degree 10 risks overfitting: such a flexible curve can chase noise and swing wildly, giving good training error but poor performance on new data. Better to start low (degree 2 or 3, which can already make a single peak) and increase the degree only if validation performance justifies it.
(a) Study A is data mining: purpose = pattern discovery; question = “what is in this data?”; output = an interpretable pattern/report (four clusters); evaluation = human interpretation in business terms; the human role is central; it looks to the past. It uses clustering = unsupervised (no labels, number of groups unknown in advance).
Study B is machine learning: purpose = prediction/generalisation; question = “what happens with new data?”; output = a trained model that scores new cases; evaluation = numerical metrics; the process is partly automated; it looks to the future. It uses classification = supervised (trained on the “churned / did not churn” label).
(b) Rebuttal: Machine learning is an indispensable tool of data mining, but not the whole of it. Data mining is a broader umbrella that also includes statistics, visualisation, and database queries; machine learning is also used in computer vision, NLP, and robotics. Neither is a subset of the other, so ML cannot “replace” data mining.
Choose the interpretable model (decision tree / linear regression). Because it is interpretable and can be visualised, it directly meets the regulator's duty to justify every rejection. The black-box model, although more accurate, can't easily explain its outputs and so can fall foul of regulation.
Advantages of the tree: understandable structure, can be visualised, needs no normalisation, works with numerical and categorical data. Disadvantages: unstable under small data changes and prone to overfitting.
The cost of choosing simple: you give up some accuracy — which becomes unacceptable where accuracy is critical (e.g. medical diagnosis). Middle path: use the powerful model for the actual decision and a simple model to produce the explanation — aiming for the best of both worlds.
(a) Matching:
(b) Common thread: all four use the mathematical correctness of a number to make the company look more successful, generous, or fast-growing than it really is. The company seeking investment benefits if the investor believes them. The lie hides not in the number but in what is counted, who is asked, which scale is chosen, and what is left out.
Decide True or False and say why, then reveal.
① Who is missing? (sample) ② Which average is hidden? (summary) ③ What are the axes doing? (presentation) ④ Is one thing really causing the other? (interpretation). Under all four: what does this number want me to believe, and who benefits?
Feature = input · Label = what we predict · Weight/coefficient = multiplier · Bias/intercept = base value · model formula p̂ = m·r + b · steps = Remember → Formulate → Predict.
Positive weight = positively correlated (↑ feature ↑ price) · Negative weight = negatively correlated (e.g. age) · Zero weight = irrelevant feature.
MSE = average of squared errors · RMSE = √MSE (same units as the target, easier to read; RMSE = $10k means ~$10k off per prediction). Error function = loss function = cost function. Minimising it = gradient descent (climbing down “Mount Errorest”).
Underfit = too simple = error HIGH on training AND testing. Overfit = too complex = error LOW on training but HIGH on testing (the gap!). Good model = LOW on both.
For curved data. Degree = highest power; bends ≤ (degree − 1). Still linear in its coefficients → same least-squares method. Choose the degree before training (it's a hyperparameter).
Set before training → hyperparameter (degree, learning rate). Learned during training → parameter (weights, bias).
Split into training + testing; use a validation set to pick the degree (lowest validation error). Training error always falls with complexity → can't reveal overfitting. Ties → pick the simpler model (parsimony). k-fold cross-validation = k parts, each takes a turn as test, average the errors → honest estimate.
+ interpretable, visualisable, no normalisation, handles numerical + categorical. − unstable, overfits. Choose interpretable when you must justify decisions (regulation); black box is accurate but unexplainable.
Mining = discover patterns, past, human reads meaning. ML = predict, future, automated. ML is a tool of mining; neither is a subset of the other. Clustering = unsupervised; classification = supervised.
Tick each one once you can say it from memory. (Your ticks are saved on this device.)
Remember the exam rules: define the term, apply it to the case, give an example, and never leave a blank. Explaining your reasoning is where the marks are. Good luck!