Skip to main content

Basics Of Hadoop Ecosystem - 1

Basics of Hadoop Ecosystem – 1


Built on a commodity hardware, Hadoop works on the basic assumption that hardware failures are common. These failures are taken care by Hadoop Framework. Hadoop is an open source framework developed using Java based programming language to support and process humongous data in a distributed computing environment. Click here to read Part 2...

Developed back in 2005, Hadoop is an open source framework developed using Java based programming language to support and process humongous data in a distributed computing environment. Doug Cutting and Mike Cafarella are the developers of the Hadoop.  
Built on a commodity hardware, Hadoop works on the basic assumption that hardware failures are common. These failures are taken care by Hadoop Framework.

What is Hadoop Ecosystem?
Hadoop Ecosystem refers to the various components of Apache Hadoop Software library. It is a set of tools and accessories to address particular needs in processing the Big Data.
In other words, a set of different modules interacting together forms a Hadoop Ecosystem.
I have given an overview of the applications, tools and modules or interfaces currently available in the Hadoop Ecosystem. Discussed below are different components of the Hadoop.
Let us start with core components of Hadoop Framework:

DISTRUBUTED STORAGE:

HDFS
  • It stands for Hadoop Distributed File System.
  • It is a distributed File system for redundant storage.
  • Designed to store data on the commodity hardware reliably.
  • Built to expect hardware failures.
Intended for large files and batch inserts. (Write Once, Read many times.)

hadoop environment support
Source: http://www.tdprojecthope.com


HBase (NoSQL Database)

A distributed, column-oriented database. HBase uses HDFS for its underlying storage, and supports both batch-style computations using MapReduce and point queries (random reads).
• Storage of large data volumes (billions of rows) atop clusters of commodity hardware.
• Bulk storage of logs, documents, real-time activity feeds and raw imported data.
• Consistent performance of reads/writes to data used by Hadoop applications.
• Allows Data Store to be aggregated or processed using MapReduce functionality.
• Data platform for Analytics and Machine Learning.

HCatalog
A table and storage management layer for Hadoop that enables Hadoop applications (Pig, MapReduce, and Hive) to read and write data in a tabular form as opposed to the files.

• Centralized location of storage for data used by Hadoop applications.
• Reusable data store for sequenced and iterated Hadoop processes.
• Storage of data in a relational abstraction.
• Metadata Management.
Once Data is stored, we want it to check it and create insights from the data.

DISTRUBUTED PROCESSING:

MapReduce

A distributed data processing model and execution environment that runs on large clusters of commodity machines. It uses the MapReduce algorithm that breaks down all the operations into Map or Reduce functions.
• Aggregation (Counting, Sorting, and Filtering) on large and desperate data sets.
• Scalable parallelism of Map or Reduce tasks.
• Distributed task execution.
YARN
Yet Another Resource Negotiator (YARN) is the cluster & resource management layer for the Apache Hadoop ecosystem. It is one of the main features in the second generation of Hadoop framework.
• YARN 'schedules’ applications in order to prioritize tasks and maintains big data analytics systems.
• As one part of a greater architecture, Yarn aggregates and sorts data to conduct specific queries for data retrieval. .
• It helps to allocate resources to particular applications and manages other kinds of resource monitoring tasks.

MACHINE LEARNING
Mahout
Apache Mahout is an open source project. This is primarily used for creating scalable machine learning algorithms. Mahout is a data-mining framework that normally runs with the Hadoop infrastructure in the background to manage huge volumes of data.
• Mahout offers the coder a ready-to-use framework for doing data mining tasks on larger volumes of data.
• Written on top of the Hadoop, Algorithms of Mahout makes it work well in the distributed environment.
• Mahout lets applications to analyse large sets of data effectively and in quick time.
• Comes with the distributed fitness function capabilities for evolutionary programming. Includes matrix and vector libraries.

WORKFLOW MONITORING & SCHEDULING
Oozie
Oozie is a workflow scheduler system to manage Apache Hadoop jobs. It runs workflow of the dependent jobs. It allows users to create Directed Acyclic Graphs (DAG) of workflows that run parallel and sequentially in Hadoop.

• Oozie is also very flexible. One can easily start, stop, suspend and rerun jobs.
• It makes it very easy to rerun failed workflows.
• Oozie is scalable and can manage timely execution of thousands of workflows (each consisting of dozens of jobs) in a Hadoop cluster.

SCRIPTING:
Pig
We can use Apache Pig for scripting in Hadoop. Scripting is a SQL based language and an execution environment for creating complex Map Reduce transformations. First written in the Pig Latin language Pig is translated into an executable Map Reduce jobs.
Pig also allows the user to create extended functions (UDFs) using Java.
• Scripting environment to execute ETL tasks/procedures on raw data in HDFS.
• SQL based language for creating and running complex Map Reduce functions.
• Data processing, stitching, schematizing on large and desperate data sets.
• It’s a high-level data flow language.
• It abstracts you from the specific details and allows you to focus on data processing



Comments

Popular posts from this blog

Let's Understand Ten Machine Learning Algorithms

Ten Machine Learning Algorithms to Learn Machine Learning Practitioners have different personalities. While some of them are “I am an expert in X and X can train on any type of data”, where X = some algorithm, some others are “Right tool for the right job people”. A lot of them also subscribe to “Jack of all trades. Master of one” strategy, where they have one area of deep expertise and know slightly about different fields of Machine Learning. That said, no one can deny the fact that as practicing Data Scientists, we will have to know basics of some common machine learning algorithms, which would help us engage with a new-domain problem we come across. This is a whirlwind tour of common machine learning algorithms and quick resources about them which can help you get started on them. 1. Principal Component Analysis(PCA)/SVD PCA is an unsupervised method to understand global properties of a dataset consisting of vectors. Covariance Matrix of data points is analyzed here to un...

gRPC with Java : Build Fast & Scalable Modern API & Microservices using Protocol Buffers

gRPC Java Master Class : Build Fast & Scalable Modern API for your Microservice using gRPC Protocol Buffers gRPC is a revolutionary and modern way to define and write APIs for your microservices. The days of REST, JSON and Swagger are over! Now writing an API is easy, simple, fast and efficient. gRPC is created by Google and Square, is an official CNCF project (like Docker and Kubernetes) and is now used by the biggest tech companies such as Netflix, CoreOS, CockRoachDB, and so on! gRPC is very popular and has over 15,000 stars on GitHub (2 times what Kafka has!). I am convinced that gRPC is the FUTURE for writing API for microservices so I want to give you a chance to learn about it TODAY. Amongst the advantage of gRPC: 1) All your APIs and messages are simply defined using Protocol Buffers 2) All your server and client code for any programming language gets generated automatically for free! Saves you hours of programming 3) Data is compact and serialised 4) API ...

What is Big Data ?

What is Big Data ? It is now time to answer an important question – What is Big Data? Big data, as defined by Wikipedia, is this: “Big data is a broad term for  data sets  so large or complex that traditional  data processing  applications are inadequate. Challenges include  analysis , capture,  data curation , search,  sharing ,  storage , transfer ,  visualization ,  querying  and  information privacy . The term often refers simply to the use of  predictive analytics  or certain other advanced methods to extract value from data, and seldom to a particular size of data set.” In simple terms, Big Data is data that has the 3 characteristics that we mentioned in the last section – • It is big – typically in terabytes or even petabytes • It is varied – it could be a traditional database, it could be video data, log data, text data or even voice data • It keeps increasing as new data keeps flowing in This kin...