Skip to main content

Basics Of Hadoop Ecosystem - 1

Basics of Hadoop Ecosystem – 1


Built on a commodity hardware, Hadoop works on the basic assumption that hardware failures are common. These failures are taken care by Hadoop Framework. Hadoop is an open source framework developed using Java based programming language to support and process humongous data in a distributed computing environment. Click here to read Part 2...

Developed back in 2005, Hadoop is an open source framework developed using Java based programming language to support and process humongous data in a distributed computing environment. Doug Cutting and Mike Cafarella are the developers of the Hadoop.  
Built on a commodity hardware, Hadoop works on the basic assumption that hardware failures are common. These failures are taken care by Hadoop Framework.

What is Hadoop Ecosystem?
Hadoop Ecosystem refers to the various components of Apache Hadoop Software library. It is a set of tools and accessories to address particular needs in processing the Big Data.
In other words, a set of different modules interacting together forms a Hadoop Ecosystem.
I have given an overview of the applications, tools and modules or interfaces currently available in the Hadoop Ecosystem. Discussed below are different components of the Hadoop.
Let us start with core components of Hadoop Framework:

DISTRUBUTED STORAGE:

HDFS
  • It stands for Hadoop Distributed File System.
  • It is a distributed File system for redundant storage.
  • Designed to store data on the commodity hardware reliably.
  • Built to expect hardware failures.
Intended for large files and batch inserts. (Write Once, Read many times.)

hadoop environment support
Source: http://www.tdprojecthope.com


HBase (NoSQL Database)

A distributed, column-oriented database. HBase uses HDFS for its underlying storage, and supports both batch-style computations using MapReduce and point queries (random reads).
• Storage of large data volumes (billions of rows) atop clusters of commodity hardware.
• Bulk storage of logs, documents, real-time activity feeds and raw imported data.
• Consistent performance of reads/writes to data used by Hadoop applications.
• Allows Data Store to be aggregated or processed using MapReduce functionality.
• Data platform for Analytics and Machine Learning.

HCatalog
A table and storage management layer for Hadoop that enables Hadoop applications (Pig, MapReduce, and Hive) to read and write data in a tabular form as opposed to the files.

• Centralized location of storage for data used by Hadoop applications.
• Reusable data store for sequenced and iterated Hadoop processes.
• Storage of data in a relational abstraction.
• Metadata Management.
Once Data is stored, we want it to check it and create insights from the data.

DISTRUBUTED PROCESSING:

MapReduce

A distributed data processing model and execution environment that runs on large clusters of commodity machines. It uses the MapReduce algorithm that breaks down all the operations into Map or Reduce functions.
• Aggregation (Counting, Sorting, and Filtering) on large and desperate data sets.
• Scalable parallelism of Map or Reduce tasks.
• Distributed task execution.
YARN
Yet Another Resource Negotiator (YARN) is the cluster & resource management layer for the Apache Hadoop ecosystem. It is one of the main features in the second generation of Hadoop framework.
• YARN 'schedules’ applications in order to prioritize tasks and maintains big data analytics systems.
• As one part of a greater architecture, Yarn aggregates and sorts data to conduct specific queries for data retrieval. .
• It helps to allocate resources to particular applications and manages other kinds of resource monitoring tasks.

MACHINE LEARNING
Mahout
Apache Mahout is an open source project. This is primarily used for creating scalable machine learning algorithms. Mahout is a data-mining framework that normally runs with the Hadoop infrastructure in the background to manage huge volumes of data.
• Mahout offers the coder a ready-to-use framework for doing data mining tasks on larger volumes of data.
• Written on top of the Hadoop, Algorithms of Mahout makes it work well in the distributed environment.
• Mahout lets applications to analyse large sets of data effectively and in quick time.
• Comes with the distributed fitness function capabilities for evolutionary programming. Includes matrix and vector libraries.

WORKFLOW MONITORING & SCHEDULING
Oozie
Oozie is a workflow scheduler system to manage Apache Hadoop jobs. It runs workflow of the dependent jobs. It allows users to create Directed Acyclic Graphs (DAG) of workflows that run parallel and sequentially in Hadoop.

• Oozie is also very flexible. One can easily start, stop, suspend and rerun jobs.
• It makes it very easy to rerun failed workflows.
• Oozie is scalable and can manage timely execution of thousands of workflows (each consisting of dozens of jobs) in a Hadoop cluster.

SCRIPTING:
Pig
We can use Apache Pig for scripting in Hadoop. Scripting is a SQL based language and an execution environment for creating complex Map Reduce transformations. First written in the Pig Latin language Pig is translated into an executable Map Reduce jobs.
Pig also allows the user to create extended functions (UDFs) using Java.
• Scripting environment to execute ETL tasks/procedures on raw data in HDFS.
• SQL based language for creating and running complex Map Reduce functions.
• Data processing, stitching, schematizing on large and desperate data sets.
• It’s a high-level data flow language.
• It abstracts you from the specific details and allows you to focus on data processing



Comments

Popular posts from this blog

EVENT DRIVEN MICROSERVICES

EVENT BASED MICROSERVICES - Event Sourcing In a Microservice Architecture, especially with Database per Microservice, the Microservices need to exchange data. For resilient, highly scalable, and fault-tolerant systems, they should communicate asynchronously by exchanging Events. In such a case, you may want to have Atomic operations, e.g., update the Database and send the message. If you have SQL databases and want to have distributed transactions for a high volume of data, you cannot use the two-phase locking (2PL) as it does not scale. If you use NoSQL Databases and want to have a distributed transaction, you cannot use 2PL as many NoSQL databases do not support two-phase locking. In such scenarios, use Event based Architecture with Event Sourcing. In traditional databases, the Business Entity with the current “state” is directly stored. In Event Sourcing, any state-changing event or other significant events are stored instead of the entities. It means the modifications of a Busines...

KAFKA - Architecture

Kafka - Architecture What is Kafka? Kafka is an event-streaming platform that is designed to process high volumes of data in real-time. Developed by LinkedIn in 2011, it has quickly become the infrastructural backbone of companies like Netflix, Twitter, and Spotify. Why do we need Kafka? In today’s data-driven world, tracking information like user clicks, recommendations, and shopping carts can be invaluable for a company’s growth. With these analytics, companies can make the product improvements needed to boost user engagement and conversion rates. However, on sites with millions of daily users, collecting and analyzing this data is nontrivial. Kafka was des i gned to streamline this operation, acting as a robust tool that maintains efficient, real-time processing capabilities with incredible quantities of data. For instance, as of late 2019, LinkedIn’s Kafka deployments were managing more than 7 trillion messages per day. How does Kafka work? Kafka provides a structured architecture ...

Tapping Into the “Long Tail” of Big Data

Variety, not volume or velocity, drives big-data investments !!! Gartner defines big data as the three Vs: high-volume, high-velocity, high-variety information assets. While all three Vs are growing, variety is becoming the single biggest driver of big-data investments, as seen in the results of a recent survey by New Vantage Partners. This trend will continue to grow as firms seek to integrate more sources and focus on the “long tail” of big data. From schema-free JSON to nested types in other databases (relational and NoSQL), to non-flat data (Avro, Parquet, XML), data formats are multiplying and connectors are becoming crucial. In 2017, analytics platforms will be evaluated based on their ability to provide live direct connectivity to these disparate sources. Tapping Into the “Long Tail” of Big Data When asked about drivers of Big Data success, 69% of corporate executives named greater data variety as the most important factor, followed by volume (25%), with ...