Byte-Sized ML Series: Tackling Imbalanced Classes with SMOTE
- Created By shambhvi
- Posted on May 1st, 2025
- Overview
- Prerequisites
- Audience
- Curriculum
Description:
Class imbalance is one of the most common and tricky problems in real-world classification tasks. In this hands-on 90-minute session, learners will explore why imbalanced classes degrade model performance and how to correct this using resampling techniques, especially SMOTE (Synthetic Minority Oversampling Technique). Learners will build a classification pipeline that includes oversampling, under sampling, and evaluation strategies to fairly assess model performance.
Duration: 90 mins
Course Code: BDT490
Learning Objectives:
After this course, you will be able to:
- Recognize when class imbalance is a problem in classification
- Understand the limitations of accuracy as a performance metric
- Use precision, recall, F1-score, and confusion matrix effectively
- Apply SMOTE for oversampling and under sampling using `imblearn`
- Build and evaluate models with resampled data using scikit-learn
Must have some python programming experience.
Beginner to intermediate ML learners and data practitioners working on classification problems where the target classes are imbalanced (e.g., fraud detection, medical diagnosis). Familiarity with classification models and basic model evaluation metrics is expected.
Course Outline:
- Understanding Class Imbalance
- What is class imbalance? Examples (fraud, churn, spam)
- Accuracy Paradox: Why accuracy can be misleading?
- Intro to better metrics: precision, recall, F1, ROC-AUC
- Evaluating Models on Imbalanced Data
- Load and explore an imbalanced dataset (e.g., credit card fraud or synthetic data)
- Train a baseline classifier (e.g., Logistic Regression or Random Forest)
- Evaluate using confusion matrix, classification report, and ROC curve
- Hands-on: Diagnose imbalance with metrics and visualizations
- Resampling Techniques Overview
- What is resampling?
- Under sampling vs. Oversampling
- Risks: overfitting, data loss
- Intro to the `imblearn` library
- Using SMOTE to Oversample
- How SMOTE works: synthetic sample generation
- Code walkthrough using SMOTE from over_sampling
- Apply SMOTE to training data only (not test!)
- Retrain and re-evaluate the model
- Hands-on: Compare metrics before and after SMOTE
- Combining Over + Under Sampling
- Balanced approach: SMOTEENN, SMOTETomek
- Code walkthrough using SMOTE from over_sampling
- Apply pipeline with combined sampling
- Hands-on: Compare performance with standalone SMOTE
Training material provided: Yes (Digital format)
Hands-on Lab: Instructions will be provided to install Jupyter notebook and other required python libraries. Students can opt to use ‘Google Colaboratory’ if they do not want to install these tools