- Overview
- Prerequisites
- Audience
- Audience
- Curriculum
Description:
Real-real data ingestion and analysis is becoming a vital need for enterprises to have a data lake to facilitate high velocity data needs. In this course, we would explore the Big data ecosystem and deep-dive into the real-time processing areas. Open Source technologies like Apache Kafka and Spark will be incorporated to provide an end-to-end solution for creating real time streaming applications.
Long Description:
Elevate your enterprise's data capabilities with our comprehensive training course on real-time data ingestion and analysis. As high-velocity data becomes increasingly vital, we'll guide you in building a robust data lake to meet these demands. Dive deep into the Big Data ecosystem and focus on real-time processing, leveraging open-source tools like Apache Kafka and Spark. This course offers a holistic view of the Big Data landscape and its applications in real-time streaming. We'll explore alternatives to Apache Kafka and Spark Streaming, empowering you with a distributed architecture to set up, ingest, and process real-time data. Join us to unlock the potential of real-time data solutions.
Course Code/Duration:
BDT31 / 3 Days
Learning Objectives:
After this course, you will be able to:
- Have a broad understanding of Big Data Ecosystem.
- Understand the differences between batch and real time streaming scenarios.
- Understand how to use distributed architecture with clusters to be able to implement real-time streaming system.
- Discuss how to identify which kinds of technologies to be applied for specific use case.
- Explain the technical and business drivers that result from using streaming system.
- Understand the architecture and design of Apache Kafka.
- Compare Apache Kafka to other alternatives like Flume, Storm, Amazon Kinesis.
- Understand the Big Data Ecosystem before and after Apache Spark.
- Understand Apache Spark Processing Framework and distributed architecture.
- Install and Setup Big Data cluster.
- Perform hands-on activities using twitter data.
- Familiarity with Java/Scala required
- Familiarity with Big data applications
- Working knowledge of Spark is a plus
- Data Analysts, Software Engineers, Data Engineer, Data Professional, Business Intelligence Developer, Data Architect
- Data Analysts, Software Engineers, Data Engineer, Data Professional, Business Intelligence Developer, Data Architect
Course Outline:
Day 1
- Course Introduction
- History and background of Big Data
- Advantages of Distributed Architecture
- Big data Ecosystem before Apache Spark
- Big data Ecosystem after Apache Spark
- Spark Data structures: RDDs, DataFrames, Datasets
- Primer of Spark Libraries like
- Spark SQL
- Spark MLlib,
- Spark Streaming,
- Spark GraphX
- Spark Deep Learning
- Writing Spark applications using Spark APIs
- Spark streaming
- Structured streaming
- Writing Spark applications using Spark APIs
- Spark streaming
- Structured streaming
Day 2
- Data Ingestion systems for structured and unstructured data
- Kafka design & architecture
- Compare Kafka to Flume, Storm, Amazon Kinesis
- Getting Kafka up and running
- Using Kafka utilities
- Reading & Writing to Kafka using Java API
- Labs: all of the above sections
Day 3
- Implementing Spark and Kafka together
- Reading Kafka streams from Spark
- Saving streaming data from Spark into Cassandra
- Full end to end application
- Benchmarking
- Monitoring
- Tuning and Optimizing the system
- Labs: all of the above sections
- End-to-end Streaming project
- Next steps
Structured Activity/Exercises/Case Studies:
Day 1
- Milestone 1 – Create account on Databricks Cloud
- Milestone 2 – Learn how to use Databricks Notebooks
- Milestone 3 – Spark RDD implementations
Day 2
- Milestone 3 – End to End project (Initiation)
- Milestone 4 – Kafka setup
- Milestone 5 – Kafka hands-on
Day 3
- Milestone 6 – Kafka with structured Streaming
- Milestone 7 – End-to-end project (Completion)
Training material provided:
Yes (Digital format)