- Overview
- Prerequisites
- Audience
- Audience
- Curriculum
Description:
This course will teach participants how to use Apache Hadoop and Apache Spark to solve sophisticated data science problems, producing valuable insights in a wide range of scenarios.
Day one focuses on data science basics, including data acquisition, scrubbing and manipulation, as well as a general overview of data science applications as well as the analytics and machine learning processes typically employed. A number of practical use cases are examined during class and lab sessions.
Day two focuses on Apache Hadoop and its ecosystem along with the types of data science applications typically handled by the Hadoop platform. The course outlines the statistical methods used to produce actionable business insights with MapReduce, Python, Hive and other tools.
Day three begins with an overview of the Apache Spark platform and its machine learning library, MLlib.
Participants will learn how to perform entity ranking, implement recommendation engines and perform other common data science tasks using Spark batch, streaming, graph and machine learning capabilities.
Course Code/Duration:
BDT62 / 3 Days
Learning Objectives:
In this course, participants will:
- Have a clear understanding of data science, its typical use cases and how data science is performed using a range of tools in the Apache open source ecosystem.
- Python Programming Basics. Each participant will require the ability to run a 64 bit virtual machine (provided with the course).
- This course is designed for Application developers, analysts and data scientists.
- This course is designed for Application developers, analysts and data scientists.
Course Outline:
Day 1
- Data Science
- Data Science Process Overview
- Structured and Unstructured Data
- Data Acquisition and Transformation
- Data Analysis and Machine Learning
- Machine Learning Concepts
Day 2
- Big Data overview
- A brief history of Big Data
- History and background of Big Data and Hadoop
- 5 V’s of Big Data
- Secret Sauce of Big Data Hadoop
- Big Data Distributions in Industry
- End-to-End Big Data Life cycle overview
- Demos and Labs
- Big Data Ecosystem before Spark
- Big Data Ecosystem before Apache Spark
- Storage options – HDFS and No-SQL
- Processing options – MapReduce, Hive etc.
- Administrative tools – Zookeeper, Ozzie etc.
- Ingestion tools – Sqoop, Flume
- Demos and Labs
Day 3
- Getting Started with Apache Spark
- Introduction to Spark RDD
- Spark RDD Transformation and Actions
- Spark Lifecycle
- Spark Caching
- Setup Account on Apache Spark Databricks Cloud
- Databricks Notebooks overview
- Lab – Spark RDD Transformation & Actions
- Lab – Spark RDD Advanced Transformation & Actions
- Demos and Labs
- Apache Spark SQL, DataFrames, Datasets
- Introduction to Spark SQL
- SQL, DataFrames and Datasets Spark Library
- Compare the various APIs – RDD, DataFrames and Datasets
- Demos and Labs
- Machine Learning using Apache Spark
- Introduction to Machine Learning and Data Science
- Machine Learning Spark Library
- Spark Machine Learning examples
- Demos and Labs
- Streaming using Apache Spark
- Need of real time processing
- Streaming Spark Library
- Spark Streaming examples
- Demos and Labs
Training material provided:
Yes (Digital format)