About Hadoop
Hadoop is an open source distributed framework which is us for data storage of enormous data applications running in clustered system as well as managing data processing of big applications. Hadoop is considered as center of developing environment of big data technologies that are essentially used in machine learning applications, data mining, advanced analytics, predictive analytics, etc. Hadoop can deal with different forms of structured and unstructured data, providing users greater adaptability for collecting, managing and analyzing as compared to relational database systems and data warehouse systems.
Hadoop Training course is intended to give the essential information and skills for you to end up an effective Hadoop architect, big data engineer or Hadoop administrator. It starts with instructional exercises on the basic ideas of Apache Hadoop and Hadoop Cluster. It empowers you to deploy, design, manage, monitor and secure a Hadoop Cluster. The course will likewise give a brief on Hive and HBase Administration. It will likewise incorporate numerous challenging and practical exercises. Towards end of the course, you will have the capacity to comprehend and solve industry-pertinent issues that you will experience while working on Hadoop Cluster.
Content
Hadoop Basic Concepts
• What is Hadoop?
• The Hadoop Distributed File System
• How Hadoop Map Reduce Works
• Anatomy of a Hadoop Cluster
Setting up a Hadoop Cluster
• Make a fully distributed Hadoop Cluster
• Network Topology
• Cluster Specification and installation
• Hadoop Configuration
Hadoop Daemons
• Master Daemons
• Name node
• Job Tracker
• Secondary name node
• Slave Daemons
• Data node
• Task tracker
Writing a Map Reduce Program
• Examining a sample mapreduce program with several examples
• Basic API Concepts
• The Driver Code
• The Mapper
• The Reducer
• The configure and close methods
• Sequence Files
• Record Reader
• Record writer
• Role of Reporter
• Output Collector
• Processing XML Files
• Counters
• Directly Accessing HDFS
• Tool runner
• Using the Distributed Cache
Common Map Reduce Algorithms
• Sorting, Searching and Indexing
• Word Co-occurrence
• Identity Mapper
• Identity Reducer
• Exploring well-known problems using Map Reduce applications
Overview of Spark
• • What is Spark?
• Hadoop & Spark
• Features of Spark
• Spark Ecosystems
• Spark Streaming
• Spark SQL
• Spark MLib
• Spark Architecture
• Resilient Distributed Datasets
• How to install Spark
• How to run Spark
• How to interact with Spark
• Spark Web Console
• Shared Variables
• Spark Applications
• Word Count Application
Hive
• Hive Concepts
• Hive architecture
• Create database, access it from java client
• Buckets
• Partition
• Joins in hive
• Inner Joins
• Outer Joins
• Hive UDF
Sqoop
• Getting Sqoop
• A sample import
• Database Imports
• Controlling the Import
• Imports and Consistency
• Direct-mode Imports
• Performing an export
HDFS (Hadoop Distributed File System)
• Blocks and Splits
• Input Splits
• HDFS Splits
• Methods of accessing HDFS
• JAVA Approach
• CLI Approach
• Cluster Architecture and Block Placement
• Data Replication
• Hadoop Rack Awareness
• High data availability
• Data Integrity
• Programming Practices
• Developing Maps Reduce Programs in
• Local Mode
• Running without HDFS and Map reduce
• Pseudo-distributed mode
• Running all daemons in a single node
• Fully distributed mode
• Running daemons in dedicated nodes
Debugging Map Reduce Programs
• Testing with MR Unit
• Logging
• Other Debugging Strategies
Advanced Map reduce Programming
• A recap of the Map reduce Flow
• The Secondary Sort
• Customized Input formats and Output formats
Introduction to YARN
• What is YARN?
• Why YARN?
• Advantages of YARN
• YARN Daemons
• Resource Manager
• Node Manager
• Application Master
• Classic Mapreduce Vs YARN
• Anatomy of a YARN application run
• Scheduling in YARN
• Fair Scheduler
• Capacity Scheduler
• YARN as a platform for multiple applications
• Supported YARN applications
Impala
• Introducing Cloudera Impala
• Impala Benefits
• How Cloudera Impala works with CDH
• Primary Impala Features
• Impala Concepts and Architecture
• Components of the Impala Server
• The Impala Daemon
• The Impala Statestore
• The Impala Catalogue Service
• Overview of the Impala SQL Dialect
• How Impala fits into the Hadoop Ecosystem
• How Impala works with Hive
• Overview of Impala Metadata and Metastore
• How Impala uses HDFS
PIG
• Pig basics
• PIG Vs Map reduce and SQL
• PIG Vs Hive
• Write sample Pig Latin Scripts
• Modes of running PIG
• Running in Grunt shell
• Pig UDFs
• Pig Macros
Flume
• Flume Concepts
• Create a sample application to capture logs from Apache using Flume
CDH Enhancements
• Name Node High-Availability
• Name node federation
• Fencing
Certification
After the successful completion of the training and project he/she will be awarded with training certificate/certificate of completion
Placement Preparation
Along with this course, you will also get complementary (free of cost) access to the Gradient Infotech placement preparation module, which is a package to help you ace your placements/ internships hunt.
You will learn how to write your resume, cover letter and how to prepare for your interviews.