Skip to main content

Posts

Showing posts from August, 2017

Basics Of Hadoop Ecosystem - 1

Basics of Hadoop Ecosystem – 1 Built on a commodity hardware, Hadoop works on the basic assumption that hardware failures are common. These failures are taken care by Hadoop Framework. Hadoop is an open source framework developed using Java based programming language to support and process humongous data in a distributed computing environment.  Click here  to read Part 2... Developed back in 2005, Hadoop is an open source framework developed using Java based programming language to support and process humongous data in a  distributed computing environment . Doug Cutting and Mike Cafarella are the developers of the Hadoop.   Built on a commodity hardware, Hadoop works on the basic assumption that hardware failures are common. These failures are taken care by Hadoop Framework. What is Hadoop Ecosystem? Hadoop Ecosystem refers to the various components of Apache Hadoop Software library. It is a set of tools and accessories to address particular needs in pr

Hadoop As Big Data

Introduction: In this blog, I will discuss Big Data, its characteristics, different sources of Big Dataand some key components of Hadoop Framework. In the two part blog series, I will cover the basics of Hadoop Ecosystem. Let us start with Big Data and its importance in Hadoop Framework. Ethics, privacy, security measures are very important and need to be taken care while dealing with the challenges of Big Data. Big Data: When the Data itself becomes the part of the problem. Data is crucial for all organizations. It has to be stored for future use. We can refer the term Big Data as the data, which is beyond the storage capacity and the processing power of an organization. What are the sources of this huge data? There are different sources of data such as the social networks, CCTV cameras, sensors, online shopping portals, hospitality data, GPS, automobile industry etc., that generate data massively. Big Data can be characterized as: The Volume of the Data Velocity of

Basics Of Hadoop Ecosystem - 2

Basics of Hadoop Ecosystem – 2 Apache Ambari is a web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters. It has a very simple yet highly interactive UI to install various tools and perform various management, configuration and monitoring tasks.  Introduction: In part 2 of the blog series I will cover the other core components of a Hadoop Framework including querying part, external integration, data exchange, co-ordination and managing as well as monitoring of Hadoop Clusters. Please refer to Basic’s of Hadoop Ecosystem  Part 1 ... QUERYING: Pig seems quite useful however; I am more of an SQL person. For those of us who still like SQL, we have SQL for Hadoop. HIVE Hive is a distributed data warehouse built on atop of HDFS to manage and organize large amounts of data. It provides a query based on SQL semantics  (HiveQL)  which is translated by the runtime engine to Map Reduce jobs for querying the data. Schematized data store 

Tapping Into the “Long Tail” of Big Data

Variety, not volume or velocity, drives big-data investments !!! Gartner defines big data as the three Vs: high-volume, high-velocity, high-variety information assets. While all three Vs are growing, variety is becoming the single biggest driver of big-data investments, as seen in the results of a recent survey by New Vantage Partners. This trend will continue to grow as firms seek to integrate more sources and focus on the “long tail” of big data. From schema-free JSON to nested types in other databases (relational and NoSQL), to non-flat data (Avro, Parquet, XML), data formats are multiplying and connectors are becoming crucial. In 2017, analytics platforms will be evaluated based on their ability to provide live direct connectivity to these disparate sources. Tapping Into the “Long Tail” of Big Data When asked about drivers of Big Data success, 69% of corporate executives named greater data variety as the most important factor, followed by volume (25%), with velocity (6%)

What is Big Data ?

What is Big Data ? It is now time to answer an important question – What is Big Data? Big data, as defined by Wikipedia, is this: “Big data is a broad term for  data sets  so large or complex that traditional  data processing  applications are inadequate. Challenges include  analysis , capture,  data curation , search,  sharing ,  storage , transfer ,  visualization ,  querying  and  information privacy . The term often refers simply to the use of  predictive analytics  or certain other advanced methods to extract value from data, and seldom to a particular size of data set.” In simple terms, Big Data is data that has the 3 characteristics that we mentioned in the last section – • It is big – typically in terabytes or even petabytes • It is varied – it could be a traditional database, it could be video data, log data, text data or even voice data • It keeps increasing as new data keeps flowing in This kind of data is becoming common place in many fields including Science,