Skip to main content

Basics Of Hadoop Ecosystem - 2

Basics of Hadoop Ecosystem – 2


Apache Ambari is a web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters. It has a very simple yet highly interactive UI to install various tools and perform various management, configuration and monitoring tasks. 

Introduction:

In part 2 of the blog series I will cover the other core components of a Hadoop Framework including querying part, external integration, data exchange, co-ordination and managing as well as monitoring of Hadoop Clusters. Please refer to Basic’s of Hadoop Ecosystem Part 1...

QUERYING:
Pig seems quite useful however; I am more of an SQL person. For those of us who still like SQL, we have SQL for Hadoop.

HIVE
Hive is a distributed data warehouse built on atop of HDFS to manage and organize large amounts of data. It provides a query based on SQL semantics (HiveQL) which is translated by the runtime engine to Map Reduce jobs for querying the data.
  • Schematized data store for housing large amounts of raw data.
  • SQL-like environment to execute analysis and querying tasks on raw data in the HDFS.
  • Integration with the outside RDBMS applications.


apache data storage


EXTERNAL INTEGRATION:

Flume 
A distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data into HDFS. Flume's transports large quantities of event data using a streaming data flow architecture that is fault tolerant and failover recovery ready.
  • Transports large amount of event data (network traffic, logs, email messages)
  • Streams data from multiple sources into HDFS
  • Guarantees a reliable real-time data streaming to the Hadoop applications.

DATA EXCHANGE:

Sqoop 
Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and external data stores such as relational databases, and enterprise data warehouses. Sqoop is widely used in most of the Big Data companies to transfer data between relational databases and Hadoop. Sqoop works with relational databases such as Teradata, Netezza, Oracle, MySQL, Postgres etc.
  • Sqoop automates most of the process, depending on the database to describe the schema of the data to be imported.
  • Sqoop uses Map Reduce framework to import and export the data, which provides parallel mechanism as well as the fault tolerance.
  • It provides Connectors for all the major RDBMS Databases.
  • It supports full/incremental load, parallel export/import of data and data compression.
  • It supports Kerberos Security Integration.

CO-ORDINATION:

Zookeeper
Apache Zookeeper is a coordination service for distributed application that enables synchronization across a cluster. It is a centralized repository where distributed applications can put in data and get data out of it.
  • Zookeeper is a Hadoop Admin tool used for managing the jobs in a cluster.
  • Zookeeper Hadoop is akin to a watch guard as it notifies any change of data in one node to the other.

PROVISIONING, MANAGING & MONITORING HADOOP CLUSTERS:

Ambari
Apache Ambari is a web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters. It has a very simple yet highly interactive UI to install various tools and perform various management, configuration and monitoring tasks. Ambari provides a dashboard for viewing cluster health such as heat maps and ability to view MapReduce, Pig and Hive applications visually along with the features to diagnose their performance characteristics in a user-friendly manner.
  • Master services mapping with Nodes.
  • We can always choose the services that we want to install.
  • Stack Selection made easy. We can customize our services.
  • Apache Ambari provides us a simpler interface and saves lot of our efforts on installation, monitoring and managing so many components and their different installation steps along with monitoring controls at ease.

Conclusion:
Hadoop has been a very effective solution for companies dealing with extremely huge data. It is a much sought after tool in the industry for data management in the distributed systems. As it is an open source, it is easily available for the companies to leverage it for their use.

These are some highlights on Apache’s Hadoop ecosystem.  Documentation for these is available on Apache Software Foundation Website.
Hadoop and its ecosystem is expected to grow further and take on new roles even as other systems fill important roles.

Comments

Popular posts from this blog

EVENT DRIVEN MICROSERVICES

EVENT BASED MICROSERVICES - Event Sourcing In a Microservice Architecture, especially with Database per Microservice, the Microservices need to exchange data. For resilient, highly scalable, and fault-tolerant systems, they should communicate asynchronously by exchanging Events. In such a case, you may want to have Atomic operations, e.g., update the Database and send the message. If you have SQL databases and want to have distributed transactions for a high volume of data, you cannot use the two-phase locking (2PL) as it does not scale. If you use NoSQL Databases and want to have a distributed transaction, you cannot use 2PL as many NoSQL databases do not support two-phase locking. In such scenarios, use Event based Architecture with Event Sourcing. In traditional databases, the Business Entity with the current “state” is directly stored. In Event Sourcing, any state-changing event or other significant events are stored instead of the entities. It means the modifications of a Busines...

Recommendation Engines - Know How

Recommendation Engines perform a variety of tasks - but the most important one is to find products that are most relevant to the user. Content based filtering, collaborative filtering and Association rules are common approaches to do so. So let's first  Understand basics of Recommendation Engines and then we'll later on Build Our Own Recommendation Engine !!! HIGH QUALITY, PERSONALIZED  ARE THE HOLY GRAIL FOR EVERY ONLINE STORE. UNLIKE OFFLINE STORES,  ONLINE STORES HAVE NO SALES PEOPLE. USERS ON THE OTHER HAND  HAVE LIMITED TIME AND PATIENCE,  ARE NOT SURE WHAT THEY ARE LOOKING FOR  ONLINE STORES HAVE A HUGE NUMBER OF  PRODUCTS. RECOMMENDATIONS HELP USERS  NAVIGATE THE MAZE OF ONLINE STORES  FIND WHAT THEY ARE LOOKING FOR  FIND THINGS THEY MIGHT LIKE, BUT DIDN’T KNOW OF. RECOMMENDATIONS HELP ONLINE STORES  SOLVE THE PROBLEM OF DISCOVERY. BUT HOW? Lets Explain this. ONLINE STORES HAVE DATA 1) WHAT USERS  BOUGHT 2)...

KAFKA - Architecture

Kafka - Architecture What is Kafka? Kafka is an event-streaming platform that is designed to process high volumes of data in real-time. Developed by LinkedIn in 2011, it has quickly become the infrastructural backbone of companies like Netflix, Twitter, and Spotify. Why do we need Kafka? In today’s data-driven world, tracking information like user clicks, recommendations, and shopping carts can be invaluable for a company’s growth. With these analytics, companies can make the product improvements needed to boost user engagement and conversion rates. However, on sites with millions of daily users, collecting and analyzing this data is nontrivial. Kafka was des i gned to streamline this operation, acting as a robust tool that maintains efficient, real-time processing capabilities with incredible quantities of data. For instance, as of late 2019, LinkedIn’s Kafka deployments were managing more than 7 trillion messages per day. How does Kafka work? Kafka provides a structured architecture ...