Big Data Hadoop Course Details
Data is an imperative part of any organization. Every company generates huge amount of real time or batch data. That’s how big data plays an essential role irrespective of domain and authority. This comprehensive course is curated to fulfil such requirements so that we will be able to work with an extremely large amount of data.
Takeaways
- 30+ Big Data Technologies
- Big Data Engine Creation
- Streaming and Batch Processing of Data
- Various SQL Databases
- Various NOSQL Databases
- Real-Time Implementation
- Spark
- Hive
- Talend
- Informatics
- Hadoop Distributions
- Deployment
- Data Bricks Implementation
Syllabus
Introduction to Distributed Systems – Hadoop and Map Reduce
- Why Is Data So Important?
- Pre-Requisite – Data Scale
- What Is Big Data?
- Big Bank: Big Challenge
- Common Problems
- 3 Vs Of Big Data
- Defining Big Data
- Sources Of Data Flood
- Exploding Data Problem
- Redefining The Challenges Of Big Data
- Possible Solutions: Scaling Up Vs. Scaling Out
- Challenges Of Scaling Out
- Solution For Data Explosion-Hadoop
- Hadoop: Introduction
- Hadoop In Layman’s Term
- Hadoop Ecosystem
- Evolutionary Features Of Hadoop
- Hadoop Timeline
- Why Learn Big Data Technologies?
- Who Is Using Big Data?
- HDFS: Introduction
- Design Of HDFS
- Why Hadoop Cluster?
- HDFS Blocks
- Components Of Hadoop 3
- NameNode And Hadoop Cluster
- Arrangement Of Racks
- Arrangement Of Machines And Racks
- Local FS And HDFS
- NameNode
- Checkpointing
- Replica Placement
- Benefits-Replica Placement And Rack Awareness
- URI
- URL And URN
- HDFS Commands
- Problems With HDFS In Hadoop 1.X
- HDFS Federation
- High Availability
- Anatomy Of File Read From HDFS
- Data Read Steps
- Important Java Classes To Write Data To HDFS
- Anatomy Of File Write To HDFS
- Writing File To HDFS: Steps
- Building Principles
- InputSplit
- InputSplit And Data Blocks – Difference
- Why Is The Block Size 128 MB?
- RecordReader
- InputFormat
- Default Inputformat : TextInputFormat
- OutputFormat
- Using A Different OutputFormat
- Important Points
- Partitioner
- Using Partitioner
- Map Only Job
- Flow Of Operations In MapReduce
- Serialization In MapReduce
- Schedulers In YARN
- FIFO Scheduler
- Capacity Scheduler
- Fair Scheduler
- Differences Between Hadoop 1.X And Hadoop 2.X and hadoop 3.X
Hive
- Introduction
- Hive DDL
- Demo: Databases.Ddl
- Demo: Tables.Ddl
- Hive Views
- Demo: Views.Ddl
- Architecture
- Primary Data Types
- Data Load
- Demo: ImportExport.Dml
- Demo: HiveQueries.Dml
- Demo: Explain.Hql Table Types
- Demo: ExternalTable.Ddl
- Complex Data Types
- Demo: Working With Complex Datatypes
- Hive Variables
- Demo: Working With Hive Variables
- Hive Variables And Execution Customisation
Advanced Hive
- Working With Arrays
- Sort By And Order By
- Distribute By And Cluster By
- Partitioning
- Static And Dynamic Partitioning
- Bucketing Vs Partitioning
- Joins And Types
- Bucket-Map Join
- Sort-Merge-Bucket-Map Join
- Left Semi Join
- Demo: Join Optimisations
- Input Formats In Hive
- Sequence Files In Hive
- RC File In Hive
- File Formats In Hive
- ORC Files In Hive
- Inline Index In ORC Files
- ORC File Configurations In Hive
- SerDe In Hive
- Demo: CSVSerDe
- JSONSerDe
- RegexSerDe
- Analytic And Windowing In Hive
- Demo: Analytics.Hql
- Hcatalog In Hive
- Demo: Using_HCatalog
- Accessing Hive With JDBC
- Demo: HiveQueries.Java
- HiveServer2 And Beeline
- Demo: Beeline
- UDF In Hive
- Demo: ToUpper.Java And Working_with_UDF
- Optimizations In Hive
- Demo: Optimizations
NoSQL and Hbase
- Challenges With Traditional RDBMS
- Features Of NoSQL Databases
- NoSQL Database Types
- CAP Theorem
- What Is HBase Regions
- HBase HMaster ZooKeeper
- HBase First Read
- HBase Meta Table
- Region Split
- Apache HBase Architecture Benefits
- HBase Vs. RDBMS
- Shell Commands
Sqoop
- Sqoop Architecture
- Sqoop Features
- Sqoop Hands On
Python
- Python Core
- Introduction of python and comparison with other
- Programming language
- Installation of Anaconda Distribution and other python
- IDE Python Objects, Number & Booleans, Strings
- Container objects, Mutability of objects
- Operators Arithmetic, Bitwise, C omparison and Assignment o perators, Operators Precedence and associativity
- Conditions(If else,if elif else) Loops(While ,for)
- Break and Continue statement and Range Function.
- String Objects And Collections
- String object basics
- String methods
- Splitting and Joining Strings
- String format functions
- List object basics
- List as stack and Queues
- List comprehensions
- Tuples,Set ,Dictionaries Functions
- Tuples,Sets Dictionary Object basics, Dictionary Object methods, Dictionary View Objects.
- Functions basics, Parameter passing, Iterators Generator functions
- Lambda functions
- Map , Reduce, Filter functions
- OOPS Concepts Working With Files
- OOPS basic concepts
- Creating classes and Objects Inheritance
- Multiple Inheritance
- Working with files
- Reading and writing files
- Buffered read and write
- Other File methods
- Exception Handling Database Programming
- Using Standard Module
- Creating new modules
- Exceptions Handling with Try except
- Creating ,inserting and retrieving Table
- Updating and deleting the data
SQL
- Installing and configuring MySQL
- Install and Configure MySQL Client
- DDL- Create database/table, Drop, Alter, etc
- DML – INSERT, DELETE, UPDATE, MERGE etc
- DML – INSERT, DELETE, UPDATE, MERGE etc
- DQL – SELECT,etc
- JOINS – One Many, Many Many
- DISTINCT
- ORDER BY
- LIMIT
- WILD CARDS
- LOGICAL OPERATORS – LIKE, EQUAL, AND, OR etc
- STRING Functions
- DATE Functions
- MATH Functions
- COUNT, MIN and MAX
- SUM
- AVG
- LAG and LEAD function Examples
- Top N Analysis
- ROW_NUMBER
- RANK AND DENSE_RANK
- CASE WHEN
- PIVOT
- LISTAGG
- UNION
- Sub-Queries
- EXISTS
- NOT EXISTS
- WITH CLAUSE
- Recursive WITH & CTE
- Regular Expressions in SQL
Cassandra
- Cassandra Introduction
- Cassandra Installation in local system
- DATASTAX Cassandra setup
- Cassandra Architecture Cassandra Queries
MongoDB
- MondoDB Introduction
- MondoDB Compass Setup
- MongoDB Atlas Setup
- MondoDB Architecture
- MondoDB Queries
Spark
- Introduction To Apache Spark
- Map Reduce Limitations
- RDD’s
- Spark Context – SQLContext And HiveContext
- Programming With RDD’s
- Creating RDD’s From Text-Files
- Transformations And Actions
- How Does Spark Execution Work
- RDD API’s – Filter
- FlatMap
- Fold
- Foreach
- Glom
- GroupBy
- Map
- ReduceByKey
- Zip
- Persist
- Unpersist
- Read/Write From Storage
- RDD Examples
- RDD API’s – Aggregate
- Cartesian
- Checkpoint
- Coalesce
- Reparition
- Cogroup
- CollectAsMap
- CombineByKey
- Count And CountApprox Functions
- More RDD Examples
- Schema – StructType
- StructFields
- DataType
- DataFrame API’s And Examples
- Create Temporary Tables
- SparkSQL
- Spark Dataset
- Parquet Vs Avro
- Examples And Problem Solving On Real Data Using RDD And Converting
- The Same To Dataframe
- Create A Spark Project
- SBT / Maven
- How Do Maven Repo Work
- Accumulators
- BroadCast Variables
- Query Execution Plan
- Internal Of Spark Workings
DATABRICKS
- Databricks Introduction
- Databricks Setup
- Databricks Integration with cloud
- Databricks OPS Pipeline
- Databricks in Production
Kafka
- Introduction To Kafka
- Kakfa Architecture
- Kafka Key Consepts/Fundamentals
- Overview Of Zookeeper And It’s Role In Kafka Cluster
- Cluster, Nodes, Brokers, Topics Consumer, Producers, Logs, Partitions Consept Of Consumer Groups
- Leader & Follower Partition
- Installing One Node Kafka Cluster On Local Installing Multinode Kafka Cluster On Losal Command Line Producer And Consumer Replisation Consept For Fault Tolerance How Data Is Stored In Brokers
- Log Segments, Message Offsets, Message Index
- Isr List / Minimum Isr
- Committed Vs Uncommited Messages Writing A Kafka Producer In Java Writing A Kafka Consumer In Java Scaling Up The Kafka Cluster Achieving Exactly Once Semantics
- Integrating Kafka With Spark Structured Streaming.
Apache Airflow – Workflow Management Platform
- Introduction To Airflow And Its Usage What Is Workflow
- Cron-Job Creation Example Airflow Additional Features
- Airflow Architecture And Components Airflow Installation Demo
- Dags-Creating A Simple Helloworld Dag Introduction To Tasks And Operators
- Viewing The DAG In Ui-Graph View, Tree View, Logs Viewing
- Example Showcasing Bash Operators Usage Setting Precedence Among Various Tasks Lifecycle OfATask-Understanding Various Stages About Trigger_rules & Understanding With Example Airflow Artifact – More On Operators
- Writing Our Own Custom Operators Walkthrough Of Airflow UI
- Connections To Various Datastores & Variables
- Working With Connections, Understanding Sensors — Demo
- Building an end-to-end customer-360 pipeline using Airflow involving data collection from various sources, processing in spark, loading the processed data in hive and uploading the same to HBase and generating a notification about success of the pipeline to the downstream applications.
Spark Streaming
- Kind of Processing
- What is Real-time Processing
- The Importance of Real-time Processing
- Batch processing vs Real-tim Stream Processing Spark Streaming Data
- Spark dissretized stream or DStream Batch & Batch Interval
- Do Spark is a real-time streaming engine Stream Processing in Spark Transformed DStream
- Understanding Producer & Consumer Practisal on Real time Processing Stream Transformations
- Stateless Transformations Stateful Transformations Window Operations
- Batch Interval Window Size Sliding Interval
- Practical on Stateless Transformation Practisal on Stateful Transformation reduceByKey vs updateStateByKey Working With Sliding Window reduceByKeyAndWindow Transformation reduceByWindow Transformation countByWindow Transformation
- What Is Structured Streaming Requirement Of Strusture Streaming Limitations Of Spark Streaming Benefits Of Spark Structure Streaming
- Practical • Wordcount Example On Structured Streaming
- Dynamically Setting The ShuPle Partitions Data Stream Writer Output Modes
- Datastream Output Modes – append, update & complete
- Spark Streaming Graceful Shutdown
- How Does Spark Streaming Code Executes Internally How a Job Converted to Micro batches
- Trigger Point For Micro Batches
- Types of Triggers unspecified, time interval, one time, continuous
- Types of Data Sourses Sosket Source, Rate Source, File Source, Kafka Source
- Limitations of socket source Prastisal on File Data Source
- Types of Spark Streaming Output Data Options Fault Tolerance and Exastly Onse Guarantee Understanding Checkpoint Location
- Stateful vs Stateless Transformations
- Managed Stateful Operations vs UnManaged Stateful Operations
- Types of Aggregations – Continuous Aggregations vs Time Bound Aggregations
- Window Transformations
- UpdateStateByKey, reduceByKeyAndWindow, reduceByWindow, countByWindow
- Types of windows – Tumbling Time Window, Sliding Time Window
- Dealing With Late Coming Records Using Watermark
- State Store Cleanup
- Calculating the Watermark Boundary Streaming Joins
- Streaming Dataframe to static dataframe
- Streaming Dataframe With Another Streaming Dataframes
Big Data on Cloud
- AWS EMR (Elastic MapReduce):
- What is a VM (Virtual Machine) On-Premise vs Cloud Setup
- Major Vendors of Hadoop Distribution Why Cloud & Big Data Hadoop on Cloud Major Cloud Providers of Bigdata What is EMR
- Hdfs vs S3 What Is 53
- Important Instances in AWS Kinds of Nodes in Cluster
- Transient vs Long Running Cluster Running Spark Code on Emr
- How to Track Your Job
- Copy File From S3 to Local Zeppelin Notebook
- Types of EC2 Instances How to Create a VM What is a Keypair Elastic IP
- AWS Storage, Networking & CLI Instance Store
- S3 & EBS
- Public ip Vs Private Ip Network Switches Security Group
- Aws Command Line Interface
- Launch A Emr Cluster Using Advanced Options
- AWS Athena
- What is Athena?
- When do we require Athena What problem Athena Solve How Athena Works
- Athena Pricing
- Athena Practical Demonstration
- How to create a normal table manually on csv data residing in s3
- How to minimize data scanning in Athena How to create partition table on Parquet file
- Infering Schema automatically using AWS Glue
- AWS Glue
- What is AWS Glue? Introduction To Glue Features of Glue AWS Glue Benefits
- AWS Glue Terminology
- Pointing to Specific Data Stores and Endpoints Glue Data Catalogue
- Crawlers
- Connecting to Your Data Store Using Crawlers for Catalogue Tables
- Overview and Working of Glue Jobs Adding New Jobs in Glue
- Triggering Jobs and Their Scheduling
- AWS Redshift
- Database vs Data Warehouse vs Data Lake Introduction to Amazon Redshift
- Benefits of Amazon Redshift Use Cases of Amazon Redshift
- Redshift Master Slave Architecture Types of Nodes
- Redshift Spectrum Redshift Fault Tolerance Redshift Sort Keys
- Redshift Distribution Styles Practical Demonstration
Spark ML
- Basic statistics
- Data sources
- Pipelines
- Extracting, transforming and selecting features
- Classification and Regression
- Clustering
- Collaborative filtering
- Frequent Pattern Mining
- Model selection and tuning
- Advanced topics
Enterprise Big Data ETL Tools
- Introduction to ETL from Talend Studio- Integration with HDFS, Hive, Sqoop, Spark etc
- Introduction to ETL from Informatica BDM- Integration with HDFS, Hive, Sqoop, Spark etc
PROJECT AND INTERVIEW PREPARATION
- End-to-end Big Data Pipeline Engine PROJECT
- Involving all Major components like
- Sqoop, Hdfs, Hive, Hbase, Spark… etc.
- Interview Preparation Tips
- Sample Resume
- 300+ Mock Interview Recordings
- Mock Interview QA
- Interview Questions
- How to Handle Various Interview Round Qs
- Career Guidance
- One to One Resume Discussion
- Certification