Basics of Hadoop Ecosystem – 1
Built on a commodity hardware, Hadoop works on the basic assumption that hardware failures are common. These failures are taken care by Hadoop Framework. Hadoop is an open source framework developed using Java based programming language to support and process humongous data in a distributed computing environment. Click here to read Part 2...
Developed back in 2005, Hadoop is an open source framework developed using Java based programming language to support and process humongous data in a distributed computing environment. Doug Cutting and Mike Cafarella are the developers of the Hadoop.
Built on a commodity hardware, Hadoop works on the basic assumption that hardware failures are common. These failures are taken care by Hadoop Framework.
Built on a commodity hardware, Hadoop works on the basic assumption that hardware failures are common. These failures are taken care by Hadoop Framework.
What is Hadoop Ecosystem?
Hadoop Ecosystem refers to the various components of Apache Hadoop Software library. It is a set of tools and accessories to address particular needs in processing the Big Data.
Hadoop Ecosystem refers to the various components of Apache Hadoop Software library. It is a set of tools and accessories to address particular needs in processing the Big Data.
In other words, a set of different modules interacting together forms a Hadoop Ecosystem.
I have given an overview of the applications, tools and modules or interfaces currently available in the Hadoop Ecosystem. Discussed below are different components of the Hadoop.
Let us start with core components of Hadoop Framework:
DISTRUBUTED STORAGE:
HDFS
- It stands for Hadoop Distributed File System.
- It is a distributed File system for redundant storage.
- Designed to store data on the commodity hardware reliably.
- Built to expect hardware failures.
Source: http://www.tdprojecthope.com
HBase (NoSQL Database)
A distributed, column-oriented database. HBase uses HDFS for its underlying storage, and supports both batch-style computations using MapReduce and point queries (random reads).
• Storage of large data volumes (billions of rows) atop clusters of commodity hardware.
• Bulk storage of logs, documents, real-time activity feeds and raw imported data.
• Consistent performance of reads/writes to data used by Hadoop applications.
• Allows Data Store to be aggregated or processed using MapReduce functionality.
• Data platform for Analytics and Machine Learning.
HCatalog
A table and storage management layer for Hadoop that enables Hadoop applications (Pig, MapReduce, and Hive) to read and write data in a tabular form as opposed to the files.
• Centralized location of storage for data used by Hadoop applications.
• Reusable data store for sequenced and iterated Hadoop processes.
• Storage of data in a relational abstraction.
• Metadata Management.
Once Data is stored, we want it to check it and create insights from the data.
DISTRUBUTED PROCESSING:
MapReduce
A distributed data processing model and execution environment that runs on large clusters of commodity machines. It uses the MapReduce algorithm that breaks down all the operations into Map or Reduce functions.
• Aggregation (Counting, Sorting, and Filtering) on large and desperate data sets.
• Scalable parallelism of Map or Reduce tasks.
• Distributed task execution.
HBase (NoSQL Database)
A distributed, column-oriented database. HBase uses HDFS for its underlying storage, and supports both batch-style computations using MapReduce and point queries (random reads).
• Storage of large data volumes (billions of rows) atop clusters of commodity hardware.
• Bulk storage of logs, documents, real-time activity feeds and raw imported data.
• Consistent performance of reads/writes to data used by Hadoop applications.
• Allows Data Store to be aggregated or processed using MapReduce functionality.
• Data platform for Analytics and Machine Learning.
HCatalog
A table and storage management layer for Hadoop that enables Hadoop applications (Pig, MapReduce, and Hive) to read and write data in a tabular form as opposed to the files.
• Centralized location of storage for data used by Hadoop applications.
• Reusable data store for sequenced and iterated Hadoop processes.
• Storage of data in a relational abstraction.
• Metadata Management.
Once Data is stored, we want it to check it and create insights from the data.
DISTRUBUTED PROCESSING:
MapReduce
A distributed data processing model and execution environment that runs on large clusters of commodity machines. It uses the MapReduce algorithm that breaks down all the operations into Map or Reduce functions.
• Aggregation (Counting, Sorting, and Filtering) on large and desperate data sets.
• Scalable parallelism of Map or Reduce tasks.
• Distributed task execution.
YARN
Yet Another Resource Negotiator (YARN) is the cluster & resource management layer for the Apache Hadoop ecosystem. It is one of the main features in the second generation of Hadoop framework.
• YARN 'schedules’ applications in order to prioritize tasks and maintains big data analytics systems.
• As one part of a greater architecture, Yarn aggregates and sorts data to conduct specific queries for data retrieval. .
• It helps to allocate resources to particular applications and manages other kinds of resource monitoring tasks.
MACHINE LEARNING
Mahout
Apache Mahout is an open source project. This is primarily used for creating scalable machine learning algorithms. Mahout is a data-mining framework that normally runs with the Hadoop infrastructure in the background to manage huge volumes of data.
• Mahout offers the coder a ready-to-use framework for doing data mining tasks on larger volumes of data.
• Written on top of the Hadoop, Algorithms of Mahout makes it work well in the distributed environment.
• Mahout lets applications to analyse large sets of data effectively and in quick time.
• Comes with the distributed fitness function capabilities for evolutionary programming. Includes matrix and vector libraries.
WORKFLOW MONITORING & SCHEDULING
Oozie
Oozie is a workflow scheduler system to manage Apache Hadoop jobs. It runs workflow of the dependent jobs. It allows users to create Directed Acyclic Graphs (DAG) of workflows that run parallel and sequentially in Hadoop.
• Oozie is also very flexible. One can easily start, stop, suspend and rerun jobs.
• It makes it very easy to rerun failed workflows.
• Oozie is scalable and can manage timely execution of thousands of workflows (each consisting of dozens of jobs) in a Hadoop cluster.
SCRIPTING:
Pig
We can use Apache Pig for scripting in Hadoop. Scripting is a SQL based language and an execution environment for creating complex Map Reduce transformations. First written in the Pig Latin language Pig is translated into an executable Map Reduce jobs.
Pig also allows the user to create extended functions (UDFs) using Java.
• Scripting environment to execute ETL tasks/procedures on raw data in HDFS.
• SQL based language for creating and running complex Map Reduce functions.
• Data processing, stitching, schematizing on large and desperate data sets.
• It’s a high-level data flow language.
• It abstracts you from the specific details and allows you to focus on data processing
Yet Another Resource Negotiator (YARN) is the cluster & resource management layer for the Apache Hadoop ecosystem. It is one of the main features in the second generation of Hadoop framework.
• YARN 'schedules’ applications in order to prioritize tasks and maintains big data analytics systems.
• As one part of a greater architecture, Yarn aggregates and sorts data to conduct specific queries for data retrieval. .
• It helps to allocate resources to particular applications and manages other kinds of resource monitoring tasks.
MACHINE LEARNING
Mahout
Apache Mahout is an open source project. This is primarily used for creating scalable machine learning algorithms. Mahout is a data-mining framework that normally runs with the Hadoop infrastructure in the background to manage huge volumes of data.
• Mahout offers the coder a ready-to-use framework for doing data mining tasks on larger volumes of data.
• Written on top of the Hadoop, Algorithms of Mahout makes it work well in the distributed environment.
• Mahout lets applications to analyse large sets of data effectively and in quick time.
• Comes with the distributed fitness function capabilities for evolutionary programming. Includes matrix and vector libraries.
WORKFLOW MONITORING & SCHEDULING
Oozie
Oozie is a workflow scheduler system to manage Apache Hadoop jobs. It runs workflow of the dependent jobs. It allows users to create Directed Acyclic Graphs (DAG) of workflows that run parallel and sequentially in Hadoop.
• Oozie is also very flexible. One can easily start, stop, suspend and rerun jobs.
• It makes it very easy to rerun failed workflows.
• Oozie is scalable and can manage timely execution of thousands of workflows (each consisting of dozens of jobs) in a Hadoop cluster.
SCRIPTING:
Pig
We can use Apache Pig for scripting in Hadoop. Scripting is a SQL based language and an execution environment for creating complex Map Reduce transformations. First written in the Pig Latin language Pig is translated into an executable Map Reduce jobs.
Pig also allows the user to create extended functions (UDFs) using Java.
• Scripting environment to execute ETL tasks/procedures on raw data in HDFS.
• SQL based language for creating and running complex Map Reduce functions.
• Data processing, stitching, schematizing on large and desperate data sets.
• It’s a high-level data flow language.
• It abstracts you from the specific details and allows you to focus on data processing
- Get link
- X
- Other Apps
- Get link
- X
- Other Apps
Comments
Post a Comment