Why a future GenAI engineer needs these: structured data is everywhere, interviews test them relentlessly, and every idea here — loss functions, the softmax output — gets inherited by neural networks. Master these and deep learning becomes "the same ideas, scaled up."
The two foundations
Linear regression predicts a number. It fits the straight line of best fit through your data: "given square footage, predict house price." What it's really teaching you — fit a function to data by minimising error — is exactly what neural networks scale to billions of parameters.
Logistic regression, despite the name, predicts a category. It takes the linear output and squashes it through an S-curve into a probability between 0 and 1.
The tree family
Decision tree — a flowchart of yes/no questions learned from data. Beautifully interpretable: you can read exactly why it decided. But alone it memorises.
Random forest — train hundreds of slightly different trees and let them vote. This is bagging: individual trees overfit in different directions, so averaging cancels the noise. The crowd is wiser than any single tree.
Gradient boosting — build trees sequentially, each one fixing the errors the previous made. This is what XGBoost and LightGBM are, and tuned, they win most tabular-data competitions.
The result that teaches the most
I ran all four on the same real NYC taxi data from Module 1 — predicting the fare from a ride's distance and passenger count. Here's the R² each scored on rides it had never seen (1.0 = perfect, 0.0 = no better than guessing the average fare):
model train R² test R² Linear Regression 0.846 0.848 Decision Tree 0.932 0.769 ← overfit — worst on new data Random Forest 0.922 0.849 Gradient Boosting 0.903 0.883 ← winner▶ Run this yourself in Colab
Real data, fixed seed — you'll get the same four numbers.
Here's the lesson, and it's more useful than "fancy always wins" or "simple always wins." Gradient boosting did win — but look how close it was: plain linear regression, the simplest model by far, landed at 0.848 versus the winner's 0.883. A ~3.5-point gap. The simple baseline tied the random forest and came within a whisker of the best model on the board, for a tiny fraction of the complexity.
Meanwhile the lone decision tree overfit — near-top training score, worst test score — exactly the Module 1 trap. So the rule a senior engineer actually follows: always run the simple baseline first. It's usually right there, it's cheaper, faster and explainable, and it tells you whether a fancy model's extra few points are even worth paying for. Sometimes they are; often they aren't.
The map: when to use what
Structured / tabular data (rows and columns, spreadsheets, business data) → gradient boosting usually wins. Classic ML is not dead here; it dominates. (Our taxi run is a live example — boosting on top.)
Unstructured data (images, text, audio) → deep learning.
Try it yourself
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor
df = sns.load_dataset("taxis").dropna(subset=["distance", "passengers", "fare"])
X, y = df[["distance", "passengers"]], df["fare"]
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.3, random_state=42)
for name, m in [
("linear", LinearRegression()),
("tree", DecisionTreeRegressor(random_state=42)),
("forest", RandomForestRegressor(n_estimators=200, random_state=42)),
("boost", GradientBoostingRegressor(random_state=42)),
]:
m.fit(Xtr, ytr)
print(name, round(m.score(Xtr, ytr), 3), round(m.score(Xte, yte), 3))
Interview questions
Difference between bagging and boosting?
Bagging (random forest) trains many models in parallel on random subsets and averages them — reduces variance. Boosting (XGBoost) trains models sequentially, each correcting the last's errors — reduces bias, usually higher accuracy but needs careful tuning.
Why is it called logistic regression if it classifies?
Because it regresses a linear function on the log-odds, then maps that to a probability with the sigmoid. The machinery is regression; the thresholded output makes it a classifier — and it's the ancestor of neural-net softmax.
When would you pick a simple model over a deep neural net?
Tabular data, small datasets, when interpretability is required (finance, healthcare), tight compute/latency budgets, or when the baseline already performs well. The experiment above is a live example.
What's feature importance and why does it matter?
A score of how much each input drove predictions. It gives interpretability — you can explain and debug decisions — which neural networks largely sacrifice. Crucial anywhere "why did the model decide this?" is a legal or trust requirement.