Skip to main content

Basics Of Hadoop Ecosystem - 2

Basics of Hadoop Ecosystem – 2


Apache Ambari is a web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters. It has a very simple yet highly interactive UI to install various tools and perform various management, configuration and monitoring tasks. 

Introduction:

In part 2 of the blog series I will cover the other core components of a Hadoop Framework including querying part, external integration, data exchange, co-ordination and managing as well as monitoring of Hadoop Clusters. Please refer to Basic’s of Hadoop Ecosystem Part 1...

QUERYING:
Pig seems quite useful however; I am more of an SQL person. For those of us who still like SQL, we have SQL for Hadoop.

HIVE
Hive is a distributed data warehouse built on atop of HDFS to manage and organize large amounts of data. It provides a query based on SQL semantics (HiveQL) which is translated by the runtime engine to Map Reduce jobs for querying the data.
  • Schematized data store for housing large amounts of raw data.
  • SQL-like environment to execute analysis and querying tasks on raw data in the HDFS.
  • Integration with the outside RDBMS applications.


apache data storage


EXTERNAL INTEGRATION:

Flume 
A distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data into HDFS. Flume's transports large quantities of event data using a streaming data flow architecture that is fault tolerant and failover recovery ready.
  • Transports large amount of event data (network traffic, logs, email messages)
  • Streams data from multiple sources into HDFS
  • Guarantees a reliable real-time data streaming to the Hadoop applications.

DATA EXCHANGE:

Sqoop 
Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and external data stores such as relational databases, and enterprise data warehouses. Sqoop is widely used in most of the Big Data companies to transfer data between relational databases and Hadoop. Sqoop works with relational databases such as Teradata, Netezza, Oracle, MySQL, Postgres etc.
  • Sqoop automates most of the process, depending on the database to describe the schema of the data to be imported.
  • Sqoop uses Map Reduce framework to import and export the data, which provides parallel mechanism as well as the fault tolerance.
  • It provides Connectors for all the major RDBMS Databases.
  • It supports full/incremental load, parallel export/import of data and data compression.
  • It supports Kerberos Security Integration.

CO-ORDINATION:

Zookeeper
Apache Zookeeper is a coordination service for distributed application that enables synchronization across a cluster. It is a centralized repository where distributed applications can put in data and get data out of it.
  • Zookeeper is a Hadoop Admin tool used for managing the jobs in a cluster.
  • Zookeeper Hadoop is akin to a watch guard as it notifies any change of data in one node to the other.

PROVISIONING, MANAGING & MONITORING HADOOP CLUSTERS:

Ambari
Apache Ambari is a web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters. It has a very simple yet highly interactive UI to install various tools and perform various management, configuration and monitoring tasks. Ambari provides a dashboard for viewing cluster health such as heat maps and ability to view MapReduce, Pig and Hive applications visually along with the features to diagnose their performance characteristics in a user-friendly manner.
  • Master services mapping with Nodes.
  • We can always choose the services that we want to install.
  • Stack Selection made easy. We can customize our services.
  • Apache Ambari provides us a simpler interface and saves lot of our efforts on installation, monitoring and managing so many components and their different installation steps along with monitoring controls at ease.

Conclusion:
Hadoop has been a very effective solution for companies dealing with extremely huge data. It is a much sought after tool in the industry for data management in the distributed systems. As it is an open source, it is easily available for the companies to leverage it for their use.

These are some highlights on Apache’s Hadoop ecosystem.  Documentation for these is available on Apache Software Foundation Website.
Hadoop and its ecosystem is expected to grow further and take on new roles even as other systems fill important roles.

Comments

Popular posts from this blog

Let's Understand Ten Machine Learning Algorithms

Ten Machine Learning Algorithms to Learn Machine Learning Practitioners have different personalities. While some of them are “I am an expert in X and X can train on any type of data”, where X = some algorithm, some others are “Right tool for the right job people”. A lot of them also subscribe to “Jack of all trades. Master of one” strategy, where they have one area of deep expertise and know slightly about different fields of Machine Learning. That said, no one can deny the fact that as practicing Data Scientists, we will have to know basics of some common machine learning algorithms, which would help us engage with a new-domain problem we come across. This is a whirlwind tour of common machine learning algorithms and quick resources about them which can help you get started on them. 1. Principal Component Analysis(PCA)/SVD PCA is an unsupervised method to understand global properties of a dataset consisting of vectors. Covariance Matrix of data points is analyzed here to un...

gRPC with Java : Build Fast & Scalable Modern API & Microservices using Protocol Buffers

gRPC Java Master Class : Build Fast & Scalable Modern API for your Microservice using gRPC Protocol Buffers gRPC is a revolutionary and modern way to define and write APIs for your microservices. The days of REST, JSON and Swagger are over! Now writing an API is easy, simple, fast and efficient. gRPC is created by Google and Square, is an official CNCF project (like Docker and Kubernetes) and is now used by the biggest tech companies such as Netflix, CoreOS, CockRoachDB, and so on! gRPC is very popular and has over 15,000 stars on GitHub (2 times what Kafka has!). I am convinced that gRPC is the FUTURE for writing API for microservices so I want to give you a chance to learn about it TODAY. Amongst the advantage of gRPC: 1) All your APIs and messages are simply defined using Protocol Buffers 2) All your server and client code for any programming language gets generated automatically for free! Saves you hours of programming 3) Data is compact and serialised 4) API ...

What is Big Data ?

What is Big Data ? It is now time to answer an important question – What is Big Data? Big data, as defined by Wikipedia, is this: “Big data is a broad term for  data sets  so large or complex that traditional  data processing  applications are inadequate. Challenges include  analysis , capture,  data curation , search,  sharing ,  storage , transfer ,  visualization ,  querying  and  information privacy . The term often refers simply to the use of  predictive analytics  or certain other advanced methods to extract value from data, and seldom to a particular size of data set.” In simple terms, Big Data is data that has the 3 characteristics that we mentioned in the last section – • It is big – typically in terabytes or even petabytes • It is varied – it could be a traditional database, it could be video data, log data, text data or even voice data • It keeps increasing as new data keeps flowing in This kin...