AI from scratch

Module 02 · Classic Algorithms

The workhorses — and how close simple gets

Regression, decision trees, random forests and boosting: the models that still dominate real interviews, and a live experiment showing the dead-simple baseline land within a whisker of the fanciest model.

Why a future GenAI engineer needs these: structured data is everywhere, interviews test them relentlessly, and every idea here — loss functions, the softmax output — gets inherited by neural networks. Master these and deep learning becomes "the same ideas, scaled up."

The two foundations

Linear regression predicts a number. It fits the straight line of best fit through your data: "given square footage, predict house price." What it's really teaching you — fit a function to data by minimising error — is exactly what neural networks scale to billions of parameters.

Logistic regression, despite the name, predicts a category. It takes the linear output and squashes it through an S-curve into a probability between 0 and 1.

This matters enormously: the output layer of every classifier neural network — and an LLM predicting the next token — is essentially massive logistic/softmax regression. Learn it small here, recognise it everywhere later.

The tree family

Decision tree — a flowchart of yes/no questions learned from data. Beautifully interpretable: you can read exactly why it decided. But alone it memorises.

Random forest — train hundreds of slightly different trees and let them vote. This is bagging: individual trees overfit in different directions, so averaging cancels the noise. The crowd is wiser than any single tree.

Gradient boosting — build trees sequentially, each one fixing the errors the previous made. This is what XGBoost and LightGBM are, and tuned, they win most tabular-data competitions.

The result that teaches the most

I ran all four on the same real NYC taxi data from Module 1 — predicting the fare from a ride's distance and passenger count. Here's the R² each scored on rides it had never seen (1.0 = perfect, 0.0 = no better than guessing the average fare):

model               train R²   test R²
Linear Regression    0.846     0.848
Decision Tree        0.932     0.769   ← overfit — worst on new data
Random Forest        0.922     0.849
Gradient Boosting    0.903     0.883   ← winner
Run this yourself in Colab

Real data, fixed seed — you'll get the same four numbers.

Here's the lesson, and it's more useful than "fancy always wins" or "simple always wins." Gradient boosting did win — but look how close it was: plain linear regression, the simplest model by far, landed at 0.848 versus the winner's 0.883. A ~3.5-point gap. The simple baseline tied the random forest and came within a whisker of the best model on the board, for a tiny fraction of the complexity.

Meanwhile the lone decision tree overfit — near-top training score, worst test score — exactly the Module 1 trap. So the rule a senior engineer actually follows: always run the simple baseline first. It's usually right there, it's cheaper, faster and explainable, and it tells you whether a fancy model's extra few points are even worth paying for. Sometimes they are; often they aren't.

The map: when to use what

Structured / tabular data (rows and columns, spreadsheets, business data) → gradient boosting usually wins. Classic ML is not dead here; it dominates. (Our taxi run is a live example — boosting on top.)

Unstructured data (images, text, audio) → deep learning.

So how much classic ML does a GenAI engineer use? You live mostly in deep-learning land — but you'll be tested on these, you'll use them for the countless tabular problems around any AI product, and they're the foundation of everything else. Not gatekeeping — scaffolding.

Try it yourself

import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor

df = sns.load_dataset("taxis").dropna(subset=["distance", "passengers", "fare"])
X, y = df[["distance", "passengers"]], df["fare"]
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.3, random_state=42)

for name, m in [
    ("linear", LinearRegression()),
    ("tree",   DecisionTreeRegressor(random_state=42)),
    ("forest", RandomForestRegressor(n_estimators=200, random_state=42)),
    ("boost",  GradientBoostingRegressor(random_state=42)),
]:
    m.fit(Xtr, ytr)
    print(name, round(m.score(Xtr, ytr), 3), round(m.score(Xte, yte), 3))

Interview questions

Difference between bagging and boosting?

Bagging (random forest) trains many models in parallel on random subsets and averages them — reduces variance. Boosting (XGBoost) trains models sequentially, each correcting the last's errors — reduces bias, usually higher accuracy but needs careful tuning.

Why is it called logistic regression if it classifies?

Because it regresses a linear function on the log-odds, then maps that to a probability with the sigmoid. The machinery is regression; the thresholded output makes it a classifier — and it's the ancestor of neural-net softmax.

When would you pick a simple model over a deep neural net?

Tabular data, small datasets, when interpretability is required (finance, healthcare), tight compute/latency budgets, or when the baseline already performs well. The experiment above is a live example.

What's feature importance and why does it matter?

A score of how much each input drove predictions. It gives interpretability — you can explain and debug decisions — which neural networks largely sacrifice. Crucial anywhere "why did the model decide this?" is a legal or trust requirement.