- Overview
- Pre-requisite
- FAQs
- Audience
- Curriculum
Session Description:
This course is an introduction to Apache Spark version 3.x. The course covers the core APIs for using Spark and fundamental mechanisms of the platform. The students will use the PySpark, Python and SQL to access and transform data. We also cover Jupyter Notebooks, Anaconda and Apache Parquet.
Course Code/Duration:
BDT167 / 2 Days Lectures/Labs (Virtual)
Pre-requisite:
- General knowledge of data stores and a working knowledge of the Python language.
Audience:
- This course is designed for developers and data analysts.
Learning Objectives:
In this course, participants will:
General knowledge of data stores and a working knowledge of the Python language.
Can I Just Enroll In A Single Course?
Click edit button to change this text. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut elit tellus, luctus nec ullamcorper mattis, pulvinar dapibus leo.
How Much Do I Have To Pay For Unreal Engine?
Click edit button to change this text. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut elit tellus, luctus nec ullamcorper mattis, pulvinar dapibus leo.
What Background Knowledge Is Necessary?u2029
Click edit button to change this text. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut elit tellus, luctus nec ullamcorper mattis, pulvinar dapibus leo.
What Is The Refund Policy?
Click edit button to change this text. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut elit tellus, luctus nec ullamcorper mattis, pulvinar dapibus leo.
This course is designed for developers and data analysts.
Topic Outline:
- Review Hadoop architecture
- Understand the fundamental architecture of Spark
- Use the core Spark APIs to operate on data
- Understand the Spark Ecosystem
- Use PySpark and Python
- Create RDDs
- Create Dataframes
- Use RDDs and DataFrames
- Use SQL on DataFrames
- Use Jupyter notebook
Course Outline:
Day 1 Foundations
Overview
- Lab Environment
- PySpark
- Spark Data
Day 2 Programming Techniques
- Spark Programming
- RDDs
- DataFrames
- Jupyter Notebook
Detail Course Topics
Day 1 Foundations
- Big Data and History
- Big Data & Hadoop Deployment
- Hadoop
- Architecture
- Understanding Storage and Processing with Hadoop
- Map Reduce and its limitations
- Eco System with Hadoop
- Spark
- Why Spark?
- Spark Eco System
- Spark Architecture
- Spark Cluster
- Understanding RDD and its value
- Spark Dataframes and Spark SQL
- Understanding PySpark
- Multiple Hands-on
- Using Jupyter Notebook
- Spark command line
- RDD and DataFrame using PySpark
- Assignment
Day 2 Programming Techniques
- RDD and DataFrames programming
- Understanding Spark internals when using RDD and DataFrames
- Schema definition
- Schema definition
- Partitioning
- User defined Functions
- Handling Corrupt Records
- Basic ETL task using PySpark
- Handling data transformations for downstream processing
- Multiple Hands-on
- Hands on Jupyter Notebooks with exercises
- Assignment Review
- Lab Environment
- Each student will be provided a virtual machine for performing hands-on labs, they are expected to use these machines in class
- These machines will be configured for Spark-2.4.7 release
- Instructions will be provided to students for setting up environment on their machines (there will be no support for debugging their environment
Training material provided: Yes (Digital format)
The curriculum is empty
[INSERT_ELEMENTOR id="19900"]