🤖Course CurriculumPremium Edition

Introduction to Machine Learning

From Spotify Recommendations to Medical Diagnosis — Build the Algorithms That Power the World

AI & Machine Learning· Intermediate· Ages 14–17· 32 Hours

Course At a Glance

Why This Course Exists

You Already Use Machine Learning 200 Times a Day.

Every recommendation you scroll past, every spam email that never reaches your inbox, every photo that gets auto-tagged with your name — that's a trained ML model making a decision. This course teaches you how to build those systems from scratch.

By the time you finish, you'll have shipped 4 real ML projects, worked with industry-standard tools (scikit-learn, pandas, seaborn), and developed the analytical thinking that top universities and tech companies are actively looking for.

Four Transformative Outcomes

1
Understand core machine learning concepts and workflows — the same pipeline used at Google, Netflix, and NHS Digital.
2
Implement supervised and unsupervised learning algorithms in Python using scikit-learn, the industry standard ML toolkit.
3
Evaluate and interpret models using professional-grade metrics: ROC/AUC, F1, R², cross-validation — not just 'accuracy'.
4
Deliver a complete ML project: problem framing → EDA → preprocessing pipeline → model training → evaluation → presentation.

What Makes This A Premium Program

🏭

Industry-Authentic Workflow

Every lesson mirrors how professional data scientists actually work — same tools, same pipeline, same rigour.

🌍

Real Product Examples

Netflix, Spotify, Uber, NHS, Tesla — students see ML in the products they already use every day.

🔬

No Shortcuts on Depth

We explain the maths behind every algorithm intuitively. Students who graduate understand HOW, not just HOW TO.

1Module

Introduction to Machine Learning Concepts

Students discover how ML powers the products they already love, learn the complete ML workflow, master data cleaning and feature engineering, and build their first real predictive model. Zero fluff — every concept is immediately applied to real data.

Approx. 8 hrs

#	Lesson Title	What Students Learn	Build / Activity	Tools & Methods
1.1	How Netflix Knows What You Want to Watch	ML is how computers learn from data instead of following hand-coded rules — and it powers virtually every product you use. Contrast traditional programming (rules → output) with ML (data + output → rules). Map the three paradigms: supervised, unsupervised, and reinforcement learning. Real examples: how Spotify builds Discover Weekly, how Gmail detects spam, how TikTok's FYP works.	🔍 Reverse-Engineer Reality: Pick 3 apps you use daily. For each: what data does the ML model consume? What is it predicting? Is it supervised or unsupervised? Sketch the data pipeline. Debate: when does personalisation become manipulation?	ML paradigmssupervised / unsupervised / reinforcementtraining dataprediction
1.2	The Recipe for Every ML Project	Every ML project follows the same pipeline: Define Problem → Collect Data → EDA → Preprocess → Train → Evaluate → Tune → Deploy. Understand why skipping any step causes real-world failures (Amazon's biased hiring tool, Microsoft's Tay chatbot). The train-test split is the single most important concept for honest evaluation — without it, your model is useless in the real world.	🗺️ Pipeline Detective: Given the problem 'predict a student's final grade from week-3 data', map the full pipeline. What data would you collect? What would the label be? How would you measure success? What could go wrong at each step?	ML pipeline: problem → EDA → preprocess → train → evaluate → deploy
1.3	Features Are Everything — Garbage In, Garbage Out	Features (X) are what the model learns from; the label (y) is what it predicts. The quality of features determines model quality — no algorithm can fix bad inputs. Learn continuous, categorical, ordinal, and binary types. Feature engineering: creating powerful new signals from raw data (e.g. 'days since last purchase' from a timestamp). Why the curse of dimensionality makes more features not always better.	⚗️ Feature Lab: Raw student dataset — DOB, attendance_days, module scores, absences. Engineer 2 new high-signal features (attendance_rate, avg_module_score). Which raw columns are useless for prediction? Use .corr() to find the top 3 correlated features with the pass/fail label.	features (X)label (y)feature engineering.corr()continuous / categorical
1.4	From Messy Real Data to ML-Ready Data	Real-world data is always dirty: missing values, wrong types, inconsistent scales, and categorical text that models can't read. Build a production-style preprocessing pipeline: SimpleImputer → OneHotEncoder → StandardScaler, chained with ColumnTransformer and Pipeline. Understand WHY scaling matters: KNN and gradient descent are sabotaged by unscaled features.	🛠️ Data Cleaning Sprint: Load a messy Airbnb or Netflix dataset with nulls, mixed types, and text categories. Build a ColumnTransformer pipeline. Print .head() before and after — transformed data should have no NaNs and z-scored numerics. Challenge: identify one feature you'd drop and explain why.	SimpleImputerOneHotEncoderStandardScalerColumnTransformerPipeline
1.5	Why Your Model Might Be Lying to You	Evaluating a model on training data always produces inflated accuracy — the model has memorised the answers. Use train_test_split() to hold out a genuine test set. Stratified splitting preserves class proportions. K-fold cross-validation: rotate through k splits for a more honest, variance-reduced performance estimate. Real consequence: models that 'work' in notebooks but fail in production.	📊 Honest Evaluation: Load the Titanic dataset. Compare: (1) accuracy on training data, (2) accuracy on test split, (3) 5-fold cross-val mean ± std. The gap between (1) and (2)/(3) is your overfitting indicator. Which gives the most trustworthy estimate? Explain to a classmate why (1) is dishonest.	train_test_split(stratify=y)cross_val_score(cv=5)StratifiedKFold
1.6	scikit-learn: The Industry Standard ML Toolkit	scikit-learn's beauty is its consistent API: every model has .fit(), .predict(), and .score() — learn one, learn them all. The estimator interface works identically for classifiers, regressors, and transformers. This is the same toolkit used by data scientists at Airbnb, Spotify, and Meta. Build your first real model on California housing data.	🏠 First Model: Load California Housing data. Fit LinearRegression, compute R², MAE, RMSE. Plot actual vs predicted as a scatter chart. Ask: would you trust this model to value a real house? What's the biggest prediction error and why?	from sklearn.linear_model import LinearRegression.fit().predict().score()r2_score
1.7	EDA — The Detective Work Before Modelling	EDA (Exploratory Data Analysis) is how data scientists understand their data before touching an algorithm. ML-focused EDA: class balance with value_counts(), feature correlations via heatmap, distribution plots by class, and outlier detection. EDA findings directly determine which algorithm to use and how to handle preprocessing — skip it and you'll build a model on a misunderstood dataset.	🕵️ Data Investigation: Load the Titanic or Student Performance dataset. Investigate: Are survival classes balanced? Which features correlate most with the outcome? Are there outliers? Write 5 actionable EDA findings ('Feature X is skewed, so we should log-transform it before modelling').	.value_counts().corr()sns.heatmap()sns.boxplot(hue=label).describe()
1.8	MODULE 1 PROJECT: The ML-Ready Dataset	Apply the full pre-modelling workflow to a real dataset — the same process a junior data scientist would follow on their first day at a company. Produce a clean, documented, ML-ready dataset with every preprocessing decision justified. This is the foundation every subsequent model will be built on.	🚀 Pipeline Project: Take a raw dataset (student performance, housing prices, or Titanic). Deliver: (1) EDA report — 5 key findings that justify modelling decisions, (2) engineered features with rationale, (3) leakage-free ColumnTransformer pipeline, (4) stratified 80/20 split with verified class distributions. Comment every code block.	Full Module 1 — pandas EDAColumnTransformerPipelinetrain_test_splitfeature engineering

2Module

Supervised Learning Algorithms

Students build 6 different ML algorithms from regression to ensemble methods — the same algorithms used in house price prediction, credit scoring, medical diagnosis, and music recommendation. Each lesson produces a working, evaluated model.

Approx. 8 hrs

#	Lesson Title	What Students Learn	Build / Activity	Tools & Methods
2.1	Teaching a Machine to Predict House Prices	Linear regression finds the hyperplane that minimises residuals — the difference between what the model predicts and reality. The equation y = w₀ + w₁x₁ + ... is how Zillow estimates property values, how banks set interest rates, and how insurance calculates premiums. Evaluation metrics: MAE (average error in original units), RMSE (penalises large errors harder), and R² (proportion of variance explained — 1.0 = perfect).	🏡 House Price Predictor: Load a housing dataset (area, rooms, location_encoded). Fit LinearRegression. Compute MAE, RMSE, R². Scatter-plot actual vs predicted. Challenge question: if R² = 0.65, would you trust this model to price your family home? What information is it missing?	LinearRegressionmean_absolute_errormean_squared_errorr2_scorenp.sqrt()
2.2	When Straight Lines Aren't Enough — Polynomial Regression	Real relationships between variables are rarely linear: a car's fuel efficiency doesn't improve linearly with speed; performance diminishes with stress. PolynomialFeatures creates new features (x², x³, x₁×x₂) that let linear models fit curved data. The bias-variance tradeoff: degree-1 underfits (too simple), degree-10 overfits (memorises noise). Learning curves reveal how model performance scales with data quantity.	📈 Curve Fitting Race: Dataset with a clear curved trend. Fit degree-1, 2, and 3 polynomials. Plot all three curves over the data. Which wins on train data? Which wins on test data? Plot learning curves for degree-2 — what happens as training set size grows?	PolynomialFeatures(degree=2)Pipeline([polyscalermodel])learning_curve()
2.3	Will This Patient Test Positive? — Logistic Regression	Logistic regression outputs a probability (0–1) via the sigmoid function — perfect for binary classification. Decision threshold: at 0.5 by default, but you can raise it (fewer false positives) or lower it (fewer false negatives) depending on stakes. Metrics that actually matter: precision (when you say positive, how often correct?), recall (how many actual positives did you catch?). In medical diagnosis, recall is life-or-death — a missed cancer diagnosis is worse than a false alarm.	🩺 Medical Screening Model: Predict diabetes risk (Pima Indians dataset). Train LogisticRegression. Print accuracy, precision, recall, F1. Draw the confusion matrix — label each cell (TP, TN, FP, FN). Lower the decision threshold to 0.3. How does recall change? What is the trade-off?	LogisticRegressionconfusion_matrixclassification_reportprecision_scorerecall_score
2.4	KNN — How Spotify Decides You Might Like That Song	KNN classifies new points by majority vote from their k nearest neighbours in feature space — the same idea behind Spotify's 'fans also like' and Amazon's 'customers also bought'. Euclidean distance is the measure of similarity. Small k = sharp, noisy boundaries (high variance, overfits). Large k = smooth, blurry boundaries (high bias, underfits). Validation curves find the optimal k. Visualising decision boundaries makes overfitting tangible.	🎵 Music Genre Classifier: Iris dataset as a proxy. Train KNeighborsClassifier for k = 1, 3, 5, 7, 15. Plot accuracy vs k. Visualise 2D decision boundaries for the best k. Compare against LogisticRegression. At what k does the boundary look 'just right'?	KNeighborsClassifier(n_neighbors=k)validation_curve()decision boundary visualisation
2.5	Decision Trees — The Algorithm You Can Actually Explain	Decision trees build a flowchart of if-else splits that maximise information gain (measured by Gini impurity or entropy) at each node. They are the most interpretable ML model — a doctor, judge, or bank manager can read the tree and understand every decision. The problem: without max_depth, trees memorise training data perfectly (100% train accuracy, terrible test accuracy). Pruning is essential.	💳 Credit Risk Tree: Load a loan approval dataset. Train unpruned (max_depth=None) vs pruned (max_depth=4) trees. Compare train/test accuracy — see overfitting in action. Export and plot the pruned tree with plot_tree(). Trace the path a specific loan applicant takes to approval/rejection.	DecisionTreeClassifier(max_depth=)plot_tree()export_text()feature_importances_
2.6	Random Forest — Why 500 Imperfect Trees Beat One Perfect Tree	Ensemble learning: aggregating many weak learners creates a stronger combined model. Random Forest builds hundreds of trees on random data subsets and random feature subsets, then averages their predictions (bagging). Why this works: individual trees make different errors; averaged errors cancel out. Used at Facebook for content ranking, at banks for fraud detection, by hospitals for readmission prediction.	🌳 Ensemble vs Single Tree: Train RandomForestRegressor on housing data. Compare RMSE against your LinearRegression from Lesson 2.1. Plot the top 8 feature importances — what does the model think matters most? Tune n_estimators (50, 100, 200) — plot the accuracy curve. When do more trees stop helping?	RandomForestClassifier/Regressorn_estimatorsfeature_importances_bagging
2.7	Hyperparameter Tuning — Squeezing Every Drop of Performance	Model parameters are learned from data (weights, thresholds). Hyperparameters are set before training (tree depth, number of trees, learning rate). GridSearchCV exhaustively tests every combination in a grid using cross-validation — this is how Kaggle competition winners optimise their models. RandomizedSearchCV samples the grid randomly, much faster for large spaces. Warning: tune on validation data, evaluate finally on test data — never peek at test data during tuning.	🎛️ Tuning Lab: GridSearchCV on RandomForestClassifier — grid: n_estimators=[50,100,200], max_depth=[None,5,10], min_samples_split=[2,5]. Print best_params_ and the improvement over defaults. How much did tuning gain? Discuss: could you always tune to perfection given unlimited time?	GridSearchCV(param_gridcv=5)best_params_best_estimator_RandomizedSearchCV
2.8	MODULE 2 PROJECT: The Model Comparison Report	Real ML projects don't pick one algorithm and hope for the best — they compare multiple candidates systematically. This is the workflow used at every major tech company: establish a baseline, test candidates, tune the winner, evaluate once on the held-out test set. Build it, defend your choices.	📋 Model Shootout: Load a real dataset. Test 3 algorithms (LogisticRegression / LinearRegression, DecisionTree, RandomForest). Apply GridSearchCV to the best performer. Produce a comparison table: Model \| CV Score \| Test Score \| Train Score \| Overfitting? Justify your final model selection in 3 sentences.	Full Module 2 — regressionclassificationGridSearchCVconfusion matrixROC/AUC

3Module

Unsupervised Learning & Model Evaluation

Students go beyond accuracy: clustering (K-Means, DBSCAN, Hierarchical), dimensionality reduction (PCA), professional evaluation metrics (ROC/AUC, F1), regularisation, and the most important lesson in ML — preventing data leakage.

Approx. 8 hrs

#	Lesson Title	What Students Learn	Build / Activity	Tools & Methods
3.1	How Amazon Segments Its 300 Million Customers	Unsupervised learning finds hidden structure in unlabelled data. K-Means clusters data by assigning points to their nearest centroid, recalculating centroids, and repeating. Used by retail companies to segment customers, by hospitals to identify patient risk groups, by streaming platforms to discover taste communities. k-means++ initialisation reduces the sensitivity to random starting points.	🛒 Customer Segmentation: Retail dataset (annual income × spending score). Apply KMeans(k=4). Plot colour-coded scatter with centroids. Name each cluster: 'Cluster 0 = high earners, careful spenders'. Write a 1-paragraph marketing strategy recommendation for one cluster.	KMeans(n_clusters=kinit='k-means++').fit_predict(X)cluster_centers_
3.2	Choosing K — Finding the Elbow That Changes Everything	K-Means requires you to specify k upfront — but the right k is rarely obvious. The Elbow Method: plot inertia (within-cluster squared distances) vs k — find where adding more clusters stops improving much. The Silhouette Score (−1 to +1) measures how well each point fits its own cluster vs others. Using both methods together is the professional approach.	📐 K Selection Lab: Run KMeans for k=2 through k=10 on the customer dataset. Plot the Elbow curve AND Silhouette scores. Do they agree on k? Re-run with the optimal k and produce a final cluster interpretation. Debate: what k would a marketing team actually want?	inertia_silhouette_score()elbow plot[KMeans(k).fit(X).inertia_ for k in range(211)]
3.3	DBSCAN — The Clustering Algorithm That Handles Weird Shapes	K-Means assumes spherical clusters — it fails on ring shapes, crescents, and clusters of unequal density. DBSCAN (Density-Based Spatial Clustering) groups dense regions and labels sparse points as noise (label −1) — no need to specify k. Used for anomaly detection in financial fraud and GPS tracking. Agglomerative Hierarchical Clustering builds a dendrogram tree — cut it at different heights to get different numbers of clusters.	🌊 Shape Challenge: Generate blobs, rings (make_moons), and anisotropic clusters. Apply KMeans, DBSCAN, and AgglomerativeClustering to all three. Plot a 3×3 results grid. Which algorithm wins on each shape? DBSCAN should clearly outperform KMeans on the rings.	DBSCAN(epsmin_samples)AgglomerativeClusteringdendrogram()make_moons()
3.4	PCA — How Face ID Compresses Your Face to 128 Numbers	High-dimensional data (thousands of features) is sparse, slow, and impossible to visualise. PCA finds the directions of maximum variance and projects data onto fewer dimensions while preserving as much information as possible. Apple's Face ID compresses a face to ~128 values using similar techniques. The explained variance ratio tells you how much information each component retains.	🔢 Digit Vision: Load the Digits dataset (64 pixel features, 10 classes). Apply PCA(n_components=2) and plot the 2D projection colour-coded by digit. Which digits cluster tightly together? Which overlap? Compute cumulative explained variance — how many components are needed to retain 95% of information?	PCA(n_components=2).explained_variance_ratio_np.cumsum()pca.transform(X)
3.5	Overfitting — The Silent Killer of Real-World ML	Overfitting is when a model learns the training data so perfectly it fails to generalise — like memorising exam answers without understanding the concepts. Underfitting: model too simple to capture the pattern. The bias-variance tradeoff is the fundamental tension in ML. Regularisation (L1/Lasso forces coefficients to zero — automatic feature selection; L2/Ridge shrinks them — prevents wild extrapolation) is the standard defence against overfitting.	📉 Overfit Detector: Train DecisionTreeClassifier with max_depth 1 to 20. Plot train and test accuracy on the same chart — find where they diverge. That's the overfit cliff. Apply Ridge and Lasso on housing data — Lasso zeros some coefficients. Which features does Lasso keep? What does this tell you about feature importance?	Ridge(alpha=)Lasso(alpha=)learning_curve()train vs test accuracyregularisation paths
3.6	Beyond Accuracy — Metrics That Actually Tell the Truth	A model that's 95% accurate on a fraud dataset with 5% fraud cases is useless — it can achieve 95% by predicting 'not fraud' every time. ROC/AUC measures model performance at every possible decision threshold — a random model has AUC 0.5, perfect has 1.0. The Precision-Recall curve is better for severely imbalanced data. For multi-class: macro F1 (treats all classes equally) vs weighted F1 (weights by class size).	🎯 Fraud Detection Challenge: Train LogisticRegression on an imbalanced dataset (5% positives). Plot ROC curve with AUC, Precision-Recall curve, and confusion matrix. Compute macro and weighted F1. If accuracy is 95% but AUC is 0.52, is this model deployable? What would you tell the business?	roc_auc_scoreroc_curveprecision_recall_curvef1_score(average='macro'/'weighted')
3.7	Data Leakage — The Mistake That Fools Everyone (Including Experts)	Data leakage is when information from the test set contaminates the training process — producing models that appear to work brilliantly but fail instantly in production. The most common cause: fitting scalers and encoders on the full dataset before splitting. The fix: always use sklearn Pipeline so transformers are fitted inside CV folds. Nested cross-validation: inner CV for tuning, outer CV for unbiased evaluation — the gold standard.	💧 Leakage Live Demo: Scale the full dataset first, then split — record the inflated accuracy. Fix using Pipeline inside cross_val_score — see the score drop to reality. Apply StratifiedKFold(n_splits=10) with Pipeline on the student dataset. Report mean ± std. Explain to a classmate why the first approach was dishonest.	StratifiedKFoldTimeSeriesSplitPipeline inside cross_val_scoredata leakage prevention
3.8	MODULE 3 PROJECT: The Full Evaluation Report	The best data scientists don't just train models — they evaluate them rigorously, identify failure modes, and communicate findings clearly. Combine clustering analysis, dimensionality reduction, evaluation metrics, regularisation, and leakage-free CV into a single comprehensive project report.	📊 Full Evaluation Report: Chosen dataset → (1) K-Means clustering with elbow/silhouette + cluster interpretation, (2) PCA 2D visualisation with explained variance, (3) 3-model comparison with ROC/AUC curves, (4) learning curve for the best model, (5) written recommendation: which model, which k, and why — in plain English a non-technical stakeholder could act on.	KMeansDBSCANPCAROC/AUCRidge/LassoStratifiedKFoldleakage-free Pipeline

4Module

Machine Learning Capstone Project

Students ship a complete ML product: real problem, real dataset, complete pipeline, evaluated model, professional visualisations, written report, and a live Demo Day presentation. This is the portfolio piece that gets students into university ML courses and tech internships.

Approx. 8 hrs

#	Lesson Title	What Students Learn	Build / Activity	Tools & Methods
4.1	Choosing Your ML Problem — What Would You Actually Build?	Great ML projects start with a well-framed problem — not a dataset. A Problem Definition Document forces you to think like a product manager and a data scientist: what question does the model answer, who will use it, how will success be measured, and what are the ethical implications? This is the same document data science teams produce before any code is written at Google, Netflix, or NHS Digital.	🎯 Problem Definition: Choose a project track. Write a Problem Definition Document: project title, the real-world question, dataset source, target variable type (regression/classification), 3 evaluation metrics and why they matter for this specific problem, one ethical concern, and anticipated challenges. Teacher sign-off before data work begins.	Problem framingregression vs classificationmetric selectionethical consideration
4.2	Exploring Your Dataset Like a Data Detective	EDA is the most undervalued step in ML — it prevents expensive mistakes and reveals what's actually in the data versus what you assumed. ML-focused EDA: class distribution (are labels balanced?), missing data audit (what strategy for each column?), correlation heatmap (which features are most predictive?), distribution plots per class, and outlier investigation.	🔍 EDA Report: Chosen dataset — produce: (1) missing data table with imputation strategy, (2) class balance chart with imbalance ratio, (3) correlation heatmap — identify top 5 features, (4) distribution plots for 3 key features, (5) 5-point written EDA summary that directly shapes your preprocessing and modelling decisions. Each insight must be actionable.	value_counts().isnull().sum().corr()sns.pairplot()sns.heatmap().describe()
4.3	Building a Production-Grade Preprocessing Pipeline	A production data pipeline handles unseen data correctly — it doesn't re-learn on test data. Using ColumnTransformer + Pipeline ensures your imputers, encoders, and scalers are fitted only on training data and applied consistently to new inputs. Engineer at least 2 new features and document why they will improve model performance. This architecture is what separates toy notebooks from deployable systems.	⚙️ Pipeline Build Sprint: Construct the full ColumnTransformer pipeline — fit only on X_train, transform X_test. Verify: no NaNs remain, all features are numeric, scaled features have mean ~0 and std ~1. Document every transformer in a code comment block. Challenge: add a custom FunctionTransformer for your engineered features.	ColumnTransformerPipeline.fit_transform(X_train).transform(X_test)FunctionTransformer
4.4	Training Your Models — Let the Competition Begin	Test at least 3 algorithms appropriate to the problem type. Use 5-fold cross-validation for reliable comparison — single train/test splits are too noisy. The model comparison table is the central deliverable: it forces systematic thinking rather than anchoring on the first model that 'works'. Feature importances reveal what the model actually learned — often the most surprising insight.	🏆 Model Shootout Sprint: Train 3+ models with defaults. Compute 5-fold CV mean ± std for each. Build the comparison table: Model \| CV Score \| Test Score \| Train Score \| Gap (overfit?). For the best model: plot feature importances or print coefficients. What does the model think matters most about your problem?	cross_val_score(cv=5)model comparison tablefeature_importances_.coef_
4.5	Tuning Your Champion Model	The best model from the comparison gets hyperparameter tuning — the final optimisation step before real-world evaluation. GridSearchCV wrapped inside a Pipeline is the correct approach: it prevents data leakage and finds the globally optimal configuration. The final test set is evaluated EXACTLY ONCE — this is a sacred rule. Peaking at test data during tuning is the same kind of dishonesty as overfitting.	🎛️ Final Tuning Sprint: GridSearchCV or RandomizedSearchCV on the best model — all inside a Pipeline. Print best_params_ and the improvement over defaults. Retrain on the full training set with optimal params. Evaluate ONCE on the test set. Record this as your official final score. Plot the final confusion matrix (classification) or residual plot (regression).	GridSearchCV(Pipeline(steps)param_gridcv=5)best_params_one-shot test evaluation
4.6	Visualising Your Results — Telling the Data Story	Models that can't be communicated don't get used. Professional visualisations bridge the gap between model outputs and business decisions. Every chart must answer a question: the confusion matrix answers 'what types of errors are we making?', the ROC curve answers 'how does the model perform at different thresholds?', the feature importance chart answers 'what is driving the predictions?'	🎨 Visualisation Sprint: Produce all charts for the final model — confusion matrix with percentages, ROC curve with AUC annotation, feature importance bar chart, actual vs predicted scatter (regression). For EACH chart: 2 sentences below it explaining what it means for the real-world problem. Combine into a 3×2 dashboard subplot.	ConfusionMatrixDisplayRocCurveDisplaypermutation_importanceplt.subplots(32)
4.7	Your ML Report & Presentation — Communicate Like a Data Scientist	The ability to communicate findings clearly is what separates data scientists who get promoted from those who stay in Jupyter notebooks forever. Structure: Problem Statement → Data → Approach → Results → Insights → Limitations → Next Steps. The 6-minute presentation format mirrors how data science teams present at sprint reviews, all-hands, and board meetings.	📝 Report & Rehearsal: 2-page ML project report — every decision justified, no unjustified choices. Rehearse the 6-minute presentation narrating each chart: 'This confusion matrix shows X, which means Y, and the practical implication for [the business/patient/school] is Z'. Peer swap: can a classmate reproduce your results from your report alone?	ML report structurechart narrationdecision justificationlimitations6-minute format
4.8	CAPSTONE DEMO DAY — Present Your ML System to the World	Demo Day is how ML projects ship at every tech company, startup, and research lab. Present your complete ML pipeline as a coherent product story: what problem did you solve, for whom, with what data, using which approach, achieving what results — and what would you do next with more time and resources.	🎤 Capstone Demo: 6-minute live presentation with your visualisation dashboard + 3-minute Q&A. Class votes for: Most Impactful Problem, Most Rigorous Methodology, Best Visualisation. Assessed on EDA depth, leakage-free pipeline, model rigour, metric interpretation, chart quality, report clarity, and presentation confidence. Certificate of completion awarded.	Full course — sklearnEDAPipelinesupervised + unsupervisedevaluationpresentation

Capstone Project Tracks

Choose your challenge for the final module

Module 4

🎓

Student Performance Prediction

Build: Classify students as pass/fail from attendance, study habits, and demographics using LogisticRegression and RandomForest. Includes a bias audit: does the model perform equally for all demographic groups?

Real-World Impact: Used by universities to identify at-risk students before it's too late

🛒

Customer Segmentation Analysis

Build: Unsupervised K-Means and DBSCAN clustering on retail customer data. Profile each segment and write a marketing strategy recommendation per cluster.

Real-World Impact: Used by Amazon, ASOS, Zalando to personalise campaigns

⚽

Sports Performance Prediction

Build: Regression model predicting player ratings, goal output, or match win probability from match statistics. Feature importance reveals what actually determines performance.

Real-World Impact: Used by Premier League clubs and DraftKings for squad selection and betting odds

❤️

Health Risk Predictor

Build: Binary classification predicting diabetes or heart disease risk from clinical features (Pima Indians or Cleveland Heart Disease dataset). Includes recall-optimised threshold tuning.

Real-World Impact: Used by NHS and insurance companies for early intervention programmes

Teaching & Delivery Notes

🕐 Pacing

8 lessons per module at 55–65 min each (~32 hrs). Lessons 1.4 (preprocessing pipeline) and 3.7 (data leakage) are the two hardest — plan for live coding with deliberate mistakes. Module 4 = project sprints; mandatory teacher checkpoints after 4.2 (EDA), 4.4 (model comparison), and 4.6 (visualisations).

✅ Assessment Rubric

(1) EDA Quality — insights actionable, directly inform decisions. (2) Pipeline Correctness — no leakage, domain-appropriate imputation. (3) Model Rigour — ≥3 models, cross-validated. (4) Tuning & Final Eval — GridSearchCV applied, test set evaluated once. (5) Visualisation — labelled, interpreted in plain English. (6) Report & Presentation — every decision justified.

📚 Prior Knowledge Required

Confident with: Python functions, loops, conditionals, list/dict comprehensions; pandas (read_csv, filter, groupby); matplotlib/seaborn charts; NumPy arrays; basic statistics (mean, std, correlation). Students from Python Programming (Intermediate) or Introduction to Data Analysis are ideally prepared.

🚀 Stretch Activities

Advanced students: XGBoost/LightGBM, SHAP values for model explainability, SMOTE for class imbalance, SVM, MLPClassifier (intro neural net), or Flask model deployment as a REST API. These are genuine next-step industry tools.

💻 Tools & Environment

JupyterLab (Anaconda) or Google Colab (free GPU). Libraries: scikit-learn, pandas, numpy, matplotlib, seaborn. Optional: xgboost, shap (pip install). Python 3.9+. Datasets: Kaggle, UCI ML Repository, or sklearn built-in datasets. Students working in Colab need only a Google account — zero local setup.

🌐 University & Career Pathways

Graduates are prepared for: university ML modules (most CS degrees require ML fundamentals by Year 2), Kaggle competitions (beginner to intermediate), internships at companies using Python-based ML stacks, and the 'Introduction to ML' section of Google's Machine Learning Crash Course.