- Overview
- Prerequisites
- Audience
- Audience
- Curriculum
Description:
Unlock the full potential of the Apache Spark platform with our advanced course, following the introductory 'Introduction to Apache Spark version 2.3.' In this comprehensive program, you'll delve deep into Spark, mastering the art of building unified big data applications that blend batch and interactive analytics for all your data. Developers will gain the ability to create complex parallel applications that lead to quicker and more informed decisions, as well as real-time actions. Furthermore, the course provides in-depth insights into performance optimization and debugging techniques, and covers the exciting world of Spark Machine Learning. Elevate your data analytics skills with this essential training.
Course Code/Duration:
BDT168 / 2 Days Lectures/Labs (Virtual)
Learning Objectives:
In this course, participants will:
- Master core Apache Spark 3.x APIs and fundamental platform mechanisms.
- Utilize PySpark, Python, and SQL to access and transform data effectively.
- Gain proficiency in working with essential tools such as Jupyter Notebooks and Anaconda.
- Learn to handle data in Apache Parquet format for comprehensive data analysis capabilities.
- Introduction to Apache Spark version 3.x.
- This course is designed for developers and data analysts.
- This course is designed for developers and data analysts.
Course Outline:
Day 1: Spark Libraries
- Continue understanding Spark Dataframes
- Look up tables and Joins with Spark Dataframes
- Understanding data partitioning on Dataframes
- Data transformations with pipelines
- Understanding Machine Learning Use cases and Techniques
- Understand what is machine learning?
- Machine learning development v/s traditional software development
- Importance of data in machine learning
- Machine Learning Development
- Learn about the steps involved in machine learning development
- Understand how the machine learns?
- Machine Learning Algorithms and tools supported in Spark ML Library
- Building Classification and Regression Models
- Building regression models and evaluating model performance
- Building classification models and evaluating model performance
- Understanding feature engineering
- Model persistence
- Multiple Hands-on
- Using PySpark on Spark Dataframes
- Data transformations on dataframe
- Extra Credits
Day 2: Machine Learning Library and Streaming
- Data clustering with Spark Machine Learning library
- Perform data clustering using Spark Machine learning library
- Understand finding optimal clusters on a dataset
- Streaming Data
- Understand what is streaming data?
- Design Challenges with Streaming data
- Structured Streaming with Spark
- Spark library streaming history from Spark 1.x to Spark 3.x
- Enhancements to Structured Streaming library
- Learn about streaming output modes
- Perform aggregation on streaming data
- Spark ML library and Streaming library
- Build machine learning model and persist it
- Load model and use it to make predictions on streaming data
- Multiple Hands-on
- Multiple hands-on sessions on the above topics
- Extra Credits session
Lab Environment
- Each student will be provided a virtual machine for performing hands-on labs, they are expected to use these machines in class
- These machines will be configured for Spark-3.x release
- Instructions will be provided to students for setting up environment on their machines (there will be no support for debugging their environment)