- Overview
- Prerequisites
- Audience
- Audience
- Curriculum
Description:
"Join our 10-week Data Engineer Bootcamp to master essential data engineering skills. Learn to build robust Data Pipelines using SQL, Python, Spark SQL, and Pyspark, and efficiently execute them on various clusters. Our Agile Scrum Methodology-driven workshop covers SQL Fundamentals, data engineering principles, Linux, Python, and more. Dive into Big Data Technologies with Apache Spark, explore databases, including MongoDB, and gain cloud expertise with GCP. Acquire hands-on experience in DevOps, Docker, Kubernetes, and CI/CD, preparing you for a rewarding career in data engineering. Unlock a world of data possibilities and become a skilled data engineer ready to tackle complex projects in this comprehensive Data Engineer Bootcamp."
Course Code/Duration:
BDT245 / 10 Weeks
Learning Objectives:
- Understand Agile Scrum methodology for efficient project management.
- Master SQL, covering query writing, table operations, and functions.
- Develop proficiency in Python programming, focusing on data manipulation.
- Learn JavaScript basics for web development and dynamic content.
- Gain expertise in data engineering, including data preparation and pipeline building.
- Use PyTorch for creating and training artificial neural networks for classification.
- Set up a Python environment with Jupyter notebooks.
- Explore Python data structures, functions, modules, and OOP.
- Grasp big data concepts, Hadoop, and Apache Spark's role in data processing.
- Apply Apache Spark for data analysis, machine learning, and real-time streaming.
- Understanding of how computers work
- One or more years technical experience
- Programming experience with Python & SQL would be a plus.
- Candidates with Computer Science degree or equivalent experience and pursuing their first IT role with the focus on Data Engineering and Data Science.
- Candidates with Computer Science degree or equivalent experience and pursuing their first IT role with the focus on Data Engineering and Data Science.
Course Outline:
Agile Scrum Methodology
- Scrum Introduction
- Scrum Team
- Scrum Artifacts
- Sprint Increment
- Spring planning
- Backlog
- Retrospective
- Project description and Case Study
- Practice exam and Knowledge check
- Certification (optional)
SQL
- SQL Fundamentals
- Writing SQL Queries
- Working Tables and Indexes
- Predefined SQL functions
- Connecting Python to SQL
- Certification (optional)
Data Engineering Principles
- Data engineering to prepare data for downstream needs
- Build pipelines for batch processing and streaming processing
- Understanding different types of data
- Using PyTorch for Artificial Neural Network – Classification
Python Programming – Fundamentals
- Set Up
- Set up development environment – Jupyter notebooks
- Using python shell
- Executing python script
- Understanding python strings
- Print statements in python
- Data Structures in python
- Integers
- Lists
- Dictionaries
- Tuple
- Sets
- File
- Mutable and Immutable structures
- Selection and Looping Constructs
- If/else/elif statements
- Boolean type
- “in” membership
- For loop
- While Loop
- List and Dictionary Comprehension
- Functions
- Defining functions
- Variable scope – Local and Global
- Arguments
- Polymorphisms
- Modules
- Creating modules
- Importing Modules
- Different types of imports
- Dir and help
- Examining some built-in modules
- Classes & Exceptions
- Object Oriented Programming Introduction
- Classes and Objects
- Polymorphism – Function and Operator Overloading
- Inheritance
Big Data Overview
- History and background of Big Data and Hadoop
- 5 V’s of Big Data
- Big Data Distributions in Industry
- Big Data Ecosystem before Apache Spark
- Big Data Ecosystem after Apache Spark
- Comparison of MapReduce Vs Apache Spark
- Big Data Ecosystem after Apache Spark
- Spark Clusters
Getting started with Apache Spark
- Understanding Apache Spark Components and Libraries
- Introduction to Pyspark
- Explore using Pyspark in Databricks Cloud Environment
- Pyspark code examples
- Working with Jupyter Notebook
Working with Spark SQL
- Getting started with Spark SQL
- Spark Context and Spark Session
- Performing basic data transformations with Spark SQL CLI
- Managing Tables with Spark SQL
- Spark SQL functions
Apache Spark Data Structures – RDD
- Understanding fundamental data structure in Spark – RDD
- Understanding Linage and Lazy Evaluation with RDD
- Performing RDD transformations
- Performing RDD actions
- RDD persistence and caching
Apache Spark Data Structures – Data frames
- Understanding another data structure in Spark – Data frames
- Reading different file formats into Data frame
- Creating and Inferring Data frame schema
- Basic transformations on Data frames
- Basic actions on Data frames
- Apply functions such as filtering, group by, etc. on Data frame
- Aggregations, Sum, Mean, on Data frame
- Preparing data transformation pipelines
Apache Spark Data Structures – Data frames (Advanced)
- Apply functions such as filtering, group by, etc. on Data frame
- Aggregations, Sum, Mean, on Data frame
- Handling corrupt records
- Preparing data transformation pipelines
- Working with Data frame Joins
- Understanding Data frame Do’s and Don’ts
- Spark Lifecycle and Spark UI
Machine Learning Overview
- History and Background of AI and ML
- Compare AI vs ML vs DL
- Describe Supervised and Unsupervised learning techniques and usages
- Machine Learning patterns
- Classification
- Clustering
- Regression
- Gartner Hype Cycle for Emerging Technologies
- Machine Learning offerings in Industry
- Discuss Machine Learning use cases in different domains
- Understand the Data Science process to apply to ML use cases
- Understand the relation between Data Analysis and Data Science
- Identify the different roles needed for successful ML project
- Prepare machine learning data using pipelines – Data manipulation
Spark Machine Learning Library
- Prepare machine learning data using pipelines – Data manipulation
- Building Classification models with Spark Machine Learning Library
- Building Regression models with Spark Machine Learning Library
- Clustering with Spark ML library
- Understanding model performance and metrics
- Data pipeline and model persistence
Spark Streaming Library
- Streaming data and its challenges
- Understanding Spark Structured Streaming
- Working with Spark Streaming output modes
- Aggregations on streaming data
NoSQL
- Relational v/s Non-Relational Databases
- What are NoSQL databases?
- Types of NoSQL databases
Document Datastore: MongoDB
- MongoDB Introduction
- Understanding Basics and CRUD operations
- Structuring Documents
- Create Operations
- Read Operations on Collections
- Updating Documents
- Deleting Documents
- Working with Indexes
- Working with different data types
- Using MongoDB Compass to explore data visually
- Integrating Apache Spark with MongoDB
DevOps Toolkit
- DevOps Overview
- Containers with Docker
- Orchestrating containers with Kubernetes
- Understanding Continuous Integration
- Understanding Continuous Delivery and Deployment
Cloud Computing Foundations (AWS or GCP)
- Cloud Computing Overview
- Security with Google’s Cloud Infrastructure
- Understanding resource hierarchy
- IAM – Identity and Access Management
- Different IAM Roles
- Connecting to Google Cloud Platform
- Understanding different compute options
- Working with different Relational and NoSQL databases on GCP
- GCP Data Warehouse: Big Query
Project & Use Case
- Project Overview
- Complete projects to get experience and practice
- Industry Use Case Studies
Certification
- Certification Overview
- Identify the right certification for you
- Tips to prepare for certification
Training material provided:
Yes (Digital format)
Hands-on Lab: Instructions will be provided to install Anaconda and PyTorch on student’s machines.