Module 1 — ML Foundations

The core mental flip

In traditional programming, you write the rules. if temperature > 38: print("fever") — you, the human, know the rule and type it out.

In machine learning, you show examples and the machine finds the rules. You hand it 10,000 patient records labelled "fever / no fever" and it works out the threshold itself. You're not programming the logic — you're programming a process that learns the logic from data.

An LLM is the extreme version of this: nobody wrote rules for grammar or facts. It learned the patterns from billions of text examples.

The three ways machines learn

Supervised learning — the data comes with the right answers (labels). "Here are emails marked spam / not-spam, learn to predict." This is most practical ML. The fine-tuning stage of an LLM is supervised.

Unsupervised learning — no labels; find the structure yourself. Grouping customers into types nobody defined in advance. LLM pretraining is self-supervised — predict the next word, where the "label" is simply the next word in the text.

Reinforcement learning — learn by trial and reward. The "RL" in RLHF, where a model is rewarded for responses humans prefer.

The one idea everything hinges on

Every ML project follows the same path: data → split → train → evaluate → predict. The load-bearing step is the train/test split. You never judge a model on data it learned from — that's like grading students on the exact questions they studied. You hold out a chunk of data, train on the rest, then test on the held-out part to see if it truly generalises.

This is why "data contamination" is a scandal in LLM benchmarks — if test questions leaked into training data, the scores mean nothing. Same principle, billion-dollar stakes.

Overfitting: the central trap

The best way to feel this is to run it, so I built a tiny experiment on real NYC taxi trips: predict a ride's fare from its distance and passenger count, using decision trees that get steadily more complex (controlled by their max_depth).

Because we're predicting a number, we score it with R² — read it as "how much of the variation in fare the model explains," where 1.0 is perfect and 0.0 is no better than always guessing the average fare. We check it twice: on training rides the model studied, and on test rides it has never seen.

Here's what I got running it on the real data:

depth | train R² | test R²
  1   |  0.586   |  0.601   underfit — too simple
  2   |  0.759   |  0.787
  3   |  0.840   |  0.836
  5   |  0.884   |  0.875   ← test peaks: the sweet spot
  8   |  0.913   |  0.826   overfitting begins
 12   |  0.928   |  0.773
 none |  0.932   |  0.769   train high, test sagging

Don't take my word for it — run it yourself and you'll get the same numbers (the random seed is fixed):

▶ Run this yourself in Colab

No setup — it opens in your browser, press play on each cell.

import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor

# Real NYC taxi trips — a ~6,400-ride sample built into seaborn
df = sns.load_dataset("taxis").dropna(subset=["distance", "passengers", "fare"])

X = df[["distance", "passengers"]]   # the trip
y = df["fare"]                       # what we predict (a number)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

for depth in [1, 2, 3, 5, 8, 12, None]:
    m = DecisionTreeRegressor(max_depth=depth, random_state=42).fit(X_train, y_train)
    print(depth, m.score(X_train, y_train), m.score(X_test, y_test))

Run it and watch the two scores pull apart — that gap is overfitting:

Shallow tree (depth 1–2): too simple. It hasn't carved out enough detail to capture the pattern yet — this is underfitting.

Middle depth (around 5): the test score peaks at 0.875. The model has learned the real "longer ride → higher fare" pattern. This is the sweet spot.

Deep tree (depth 8 and beyond): training R² keeps climbing toward 0.93, but test R² drops — all the way down to 0.77. The tree has started memorising individual rides, including their random noise, instead of the general pattern. It does better and better on the trips it studied and worse on new ones.

The best model isn't the most complex one — it's the one at the peak of the test score. Past that point, extra complexity makes a model look smarter on paper and perform worse in reality.

That train-vs-test gap is what experienced practitioners watch obsessively. In one sentence: too simple is high bias, too complex is high variance, and the job is balancing the two.

Want to see it get dramatic? In the notebook, change y = df["fare"] to y = df["tip"]. Tips depend on unpredictable human generosity, so a deep tree's test score can go negative — a model literally worse than guessing the average.

Interview questions

Difference between supervised and unsupervised learning?

Supervised uses labelled data to learn input→output mappings; unsupervised finds structure in unlabelled data. Strong answer: LLM pretraining is self-supervised — the data labels itself using the next token as the target.

Your model gets 99% on training but 80% on test. What's happening?

Overfitting — it memorised noise. Fixes: simplify the model, add regularisation, get more data, or use cross-validation. Naming several fixes is what separates a strong answer.

Why do we need a test set at all?

To estimate performance on data the model has never seen — the only thing that matters in production. Scoring on training data is dishonestly optimistic.

What's the bias-variance tradeoff?

High bias = too simple, underfits. High variance = too complex, overfits. Total error ≈ bias² + variance + irreducible noise; you tune complexity to minimise the sum.

The bedrock: how machines actually learn

The core mental flip

The three ways machines learn

The one idea everything hinges on

Overfitting: the central trap

Interview questions