- Overview
- Audience
- Prerequisites
- Curriculum
Description:
This hands-on training introduces developers and data engineers to Apache Beam, a unified programming model for building portable and scalable data processing pipelines, with deployment on Google Cloud Dataflow. Participants will begin with the foundations of the Beam model—its architecture, execution flow, and key abstractions—and gradually move into working with core transforms like ParDo, Map, Filter, CoGroupByKey, and Composite Transforms.
The course emphasizes both batch and streaming use cases, showcasing how to connect Beam with Google Cloud Pub/Sub, set up streaming projects on GCP, and run real-time data pipelines on Google Dataflow. Learners will also understand how Beam handles type safety, data encoding, and advanced pipeline features such as side inputs and multiple outputs. A real-world case study on identifying defaulter customers helps reinforce learning through application.
Duration: 3 Days
Course Code: BDT 508
Learning Objectives:
By the end of this course, participants will be able to:
- Understand Apache Beam’s architecture and unified data processing model.
- Build batch and streaming pipelines using Beam's core and advanced transforms.
- Use Google Cloud Pub/Sub and Dataflow to run scalable real-time pipelines.
- Apply data encoding, type hints, and coders in Beam pipelines.
- Design modular, reusable pipelines using composite transforms and joins.
This course is ideal for:
- Data Engineers and Streaming Pipeline Developers
- Google Cloud Developers and Architects
- Engineers migrating from Spark, Flink, or Airflow
- Developers building real-time analytics and ETL workflows
- Basic programming knowledge (Python or Java)
- Familiarity with cloud concepts (preferably Google Cloud)
- Some experience with data processing frameworks (e.g., Spark, Hadoop) is helpful
Course Outline:
Module 1: Introduction to Apache Beam
- Evolution of Big Data Frameworks
- Overview and use cases of Apache Beam
- Apache Beam Architecture and SDKs
- Beam’s portable and unified programming model
Module 2: Beam Setup and Basic Concepts
- Key abstractions: PCollection, PTransform, Pipeline
- Installing Beam and setting up dev environment
- Building your first Beam pipeline (local runner)
Module 3: Working with Beam Transforms
- Structure of a Beam pipeline
- Input transforms: Read, Create
- Output transforms: Write to files, databases
- Core transforms: Map, FlatMap, Filter
- Hands-on: Basic read-transform-write pipeline
Module 4: Pipeline Logic and Advanced Transforms
- ParDo, DoFn and branching logic
- Composite transforms: abstraction and reuse
- Aggregations: Combine, CombinePerKey
- Joins with CoGroupByKey
- Hands-on: Branching + CoGroupByKey joins
Module 5: Side Inputs and Outputs
- Working with side inputs for auxiliary data
- Creating multiple outputs from one transform
- Hands-on: Multi-output transformation and filtering
Module 6: Case Study – Identifying Bank Defaulters
- Understanding credit card and loan data
- Creating modular pipelines for different defaulter types
- Hands-on: Implement pipeline to flag defaulters
Module 7: Type Hints and Coders in Beam
- What is type safety in data pipelines?
- Using Coder class for serialization
- Type hints in Beam and how Beam ensures type safety
- Hands-on: Using type hints and coders in a custom pipeline
Module 8: Introduction to Streaming in Beam
- Event-driven processing vs. batch workflows
- Pub/Sub architecture and flow
- Windowing and watermarking (intro only, optional deep dive)
- Hands-on: Pub/Sub demo with sample topic/stream
Module 9: Apache Beam with Google Cloud Dataflow
- Connecting Beam pipelines to GCP
- Creating and configuring Pub/Sub topics
- Running batch and streaming jobs on Dataflow
- Hands-on: Run a streaming pipeline from Pub/Sub to GCS or BigQuery
Training Material Provided:
- Course slides and reference guides



