Skip to main content

Tapping Into the “Long Tail” of Big Data

Variety, not volume or velocity, drives big-data investments !!!


Gartner defines big data as the three Vs:high-volume, high-velocity, high-variety information assets. While all three Vs are growing, variety is becoming the single biggest driver of big-data investments, as seen in the results of a recent survey by New Vantage Partners. This trend will continue to grow as firms seek to integrate more sources and focus on the “long tail” of big data. From schema-free JSON to nested types in other databases (relational and NoSQL), to non-flat data (Avro, Parquet, XML), data formats are multiplying and connectors are becoming crucial. In 2017, analytics platforms will be evaluated based on their ability to provide live direct connectivity to these disparate sources.

Tapping Into the “Long Tail” of Big Data

When asked about drivers of Big Data success, 69% of corporate executives named greater data variety as the most important factor, followed by volume (25%), with velocity (6%) trailing. In the corporate world, the big opportunity is to be found in integrating more sources of data, not bigger amounts. Variety, not volume, is king. MIT professor and 2015 Turing Award recipient Michael Stonebraker calls this the “long tail” of Big Data, as companies focus on integrating sources of data that have traditionally been ignored, as well as identifying new data sources. Stonebraker cites the example of life sciences firms with thousands of research scientists, each with their own research databases that have not been tied together for analysis in the past. Tapping into more data sources has emerged as the new data frontier within the corporate world.
How are corporations focusing their data management efforts to develop more robust data and analytics? There are 3 primary paths that firms are taking:



Capture Legacy Data Sources

It may come as a surprise, but many firms see the big opportunity in Big Data resulting from the capture of traditional legacy data sources that have gone untapped in the past. These are data sets that have typically sat outside the purview of traditional data marts or warehouses — the “long tail” data. A significant majority (57%) of firms identified this as their top data priority. One of the beauties of Big Data is that organizations can now go deeper into their own data before they turn to new sources.

Integrate Unstructured Data

Businesses have been inhibited in their ability to mine and analyze the vast amounts of information residing in text and documents. Traditional data environments were designed to maintain and process structured data — numbers and variables — not words and pictures. A growing percentage of firms (29%) are now focusing on integrating this unstructured data, for purposes ranging from customer sentiment analysis to analysis of regulatory documents to insurance claim adjudication. The ability to integrate unstructured data is broadening traditional analytics to combine quantitative metrics with qualitative content.

Add Social Media and Behavioral Data Sources

While much of the early excitement around Big Data resulted from the capture of social media and behavioral activities by firms like eBay and Facebook, these applications have been relatively nascent among the Fortune 1,000, with just 14% citing this as a priority. As firms progress with their Big Data efforts, it is likely that they will turn attention to untapped opportunities presented by social data in areas such as patient adherence and mobile device recommendations based on consumer purchasing behavior and preferences. Timely recommendations can yield immediate results.
As mainstream companies progress on their Big Data journey, we should expect that expanding the variety of data sources for analysis will continue to dominate their interests.

Comments

Popular posts from this blog

Let's Understand Ten Machine Learning Algorithms

Ten Machine Learning Algorithms to Learn Machine Learning Practitioners have different personalities. While some of them are “I am an expert in X and X can train on any type of data”, where X = some algorithm, some others are “Right tool for the right job people”. A lot of them also subscribe to “Jack of all trades. Master of one” strategy, where they have one area of deep expertise and know slightly about different fields of Machine Learning. That said, no one can deny the fact that as practicing Data Scientists, we will have to know basics of some common machine learning algorithms, which would help us engage with a new-domain problem we come across. This is a whirlwind tour of common machine learning algorithms and quick resources about them which can help you get started on them. 1. Principal Component Analysis(PCA)/SVD PCA is an unsupervised method to understand global properties of a dataset consisting of vectors. Covariance Matrix of data points is analyzed here to un...

gRPC with Java : Build Fast & Scalable Modern API & Microservices using Protocol Buffers

gRPC Java Master Class : Build Fast & Scalable Modern API for your Microservice using gRPC Protocol Buffers gRPC is a revolutionary and modern way to define and write APIs for your microservices. The days of REST, JSON and Swagger are over! Now writing an API is easy, simple, fast and efficient. gRPC is created by Google and Square, is an official CNCF project (like Docker and Kubernetes) and is now used by the biggest tech companies such as Netflix, CoreOS, CockRoachDB, and so on! gRPC is very popular and has over 15,000 stars on GitHub (2 times what Kafka has!). I am convinced that gRPC is the FUTURE for writing API for microservices so I want to give you a chance to learn about it TODAY. Amongst the advantage of gRPC: 1) All your APIs and messages are simply defined using Protocol Buffers 2) All your server and client code for any programming language gets generated automatically for free! Saves you hours of programming 3) Data is compact and serialised 4) API ...

What is Big Data ?

What is Big Data ? It is now time to answer an important question – What is Big Data? Big data, as defined by Wikipedia, is this: “Big data is a broad term for  data sets  so large or complex that traditional  data processing  applications are inadequate. Challenges include  analysis , capture,  data curation , search,  sharing ,  storage , transfer ,  visualization ,  querying  and  information privacy . The term often refers simply to the use of  predictive analytics  or certain other advanced methods to extract value from data, and seldom to a particular size of data set.” In simple terms, Big Data is data that has the 3 characteristics that we mentioned in the last section – • It is big – typically in terabytes or even petabytes • It is varied – it could be a traditional database, it could be video data, log data, text data or even voice data • It keeps increasing as new data keeps flowing in This kin...