Skip to main content

Hadoop Framework and Ecosystem

Hadoop Framework and Ecosystem


In the previous blog on Hadoop Tutorial, we discussed about Hadoop, its features and core components. Now, the next step forward is to understand Hadoop Ecosystem. It is an essential topic to understand before you start working with Hadoop. 
Hadoop Ecosystem is neither a programming language nor a service, it is a platform or framework which solves big data problems. You can consider it as a suite which encompasses a number of services (ingesting, storing, analyzing and maintaining) inside it. Let us discuss and get a brief idea about how the services work individually and in collaboration.
Below are the Hadoop components, that together form a Hadoop ecosystem, I will be covering each of them in this blog:



  • HDFS -> Hadoop Distributed File System
  • YARN -> Yet Another Resource Negotiator
  • MapReduce -> Data processing using programming
  • Spark -> In-memory Data Processing
  • PIG, HIVE-> Data Processing Services using Query (SQL-like)
  • HBase -> NoSQL Database
  • Mahout, Spark MLlib -> Machine Learning
  • Apache Drill -> SQL on Hadoop
  • Zookeeper -> Managing Cluster
  • Oozie -> Job Scheduling
  • Flume, Sqoop -> Data Ingesting Services
  • Solr & Lucene -> Searching & Indexing 
  • Ambari -> Provision, Monitor and Maintain cluster

I hope this blog is informative and added value to you. If you are interested to learn more, you can go through this blog which tells you how Big Data is used in Industries and How Hadoop Is Revolutionizing Analytics.

Comments

Popular posts from this blog

Automatic Builds With GCP Cloud Build

Automatic Builds With GCP Cloud Build If you are looking for an easy way to automatically build your application in the cloud, then maybe Google Cloud Platform (GCP) Cloud Build is for you. In this post, we will build a Spring Boot Maven project with Cloud Build, create a Docker image for it, and push it to GCP Container Registry. 1. Introduction Cloud Build is the build server tooling of GCP, something similar as Jenkins. But, Cloud Build is available out-of-the-box in your GCP account and that is a major advantage. The only thing you will need is a build configuration file in your git repository containing the build steps. Each build step is running in its own Docker container. Several cloud builders which can be used as a build step are generally available. You can read more about Cloud Build on the  overview  and  concepts  website of GCP. There are three categories of build steps: Official  cloud builders provided by GCP; Community  cloud ...

Tapping Into the “Long Tail” of Big Data

Variety, not volume or velocity, drives big-data investments !!! Gartner defines big data as the three Vs: high-volume, high-velocity, high-variety information assets. While all three Vs are growing, variety is becoming the single biggest driver of big-data investments, as seen in the results of a recent survey by New Vantage Partners. This trend will continue to grow as firms seek to integrate more sources and focus on the “long tail” of big data. From schema-free JSON to nested types in other databases (relational and NoSQL), to non-flat data (Avro, Parquet, XML), data formats are multiplying and connectors are becoming crucial. In 2017, analytics platforms will be evaluated based on their ability to provide live direct connectivity to these disparate sources. Tapping Into the “Long Tail” of Big Data When asked about drivers of Big Data success, 69% of corporate executives named greater data variety as the most important factor, followed by volume (25%), with ...

Data Engineering - Tools & Intro

Data Engineering - Tools & Intro So I just realized that I am here after a month or so. I was busy at work and traveling. I am starting a kind of new series, I say it Data Engineering Series in which I will be discussing different tools. Of course, I am not able to discuss the entire concept of Data Engineering neither I know it as I will be learning myself. What is Data Engineering? Data Engineering is all about developing, maintaining systems that are responsible for transferring data in large volumes and make it available for analysts and data scientists to use it for analyzing and data modeling. Data engineering is a superset of Data Science or the subset, not clear to me but the collaboration of data engineers and scientists fruits useful data-driven solutions. Data Engineering tools It consists of several tools. Some are dealing with data storage while others with analysis and ETL. Ofcourse, Apache Kafka is one of them. The others tools that I might be covering are Apache...