Skip to main content

SPARK SQL - Know How !!!

What is Spark SQL?

Spark is no doubt one of the most successful projects which the Apache Software Foundation could have ever conceived. They have incepted Spark SQL which integrates relational processing with the functional programming API of Spark.


Querying data through SQL or the Hive query language is possible through Spark SQL. Those familiar with RDBMS can easily relate to the syntax of Spark SQL. Locating tables and metadata couldn’t be easier to Spark SQL. Spark SQL is known for working with structured and semi structured data. Structured data is something which has a schema which has a known set of fields. When the schema and the data has no separation then the data is known as semi structured.
Spark SQL definition – Putting it simply for structured and semi structured data processing Spark SQL is used which is nothing but a module of Spark.

Hive limitations

Apache Hive was originally designed to run on top of Apache Spark. But it had considerable limitations like:
1) For running the ad-hoc queries Hive internally launches MapReduce jobs. In the processing of medium sized data sets MapReduce lags in the performance
2) If during the execution of a workflow the processing suddenly fails then Hive can’t resume from the point where it failed when the system is returned to normal.
3) When trash is enabled it leads to an execution error when encrypted databases are dropped in cascade.
Spark SQL was incepted to trump over these inefficiencies.

Architecture of Spark SQL

Spark SQL consists of three main layers such as
Language API – Spark is compatible and even supported by these languages like Python, HiveQL, Scala, Java.
SchemaRDD – RDD (resilient distributed dataset) is a special data structure which the Spark core is designed with. As Spark SQL works on schemas, tables, and records we can use Schema RDD or data frame as a temporary table.
Data sources – For Spark-core the data source is usually a text file, Avro file etc. the data sources for Spark SQL are different like JSON document, Parquet file, HIVE tables and Cassandra database.

Components of Spark SQL

Spark SQL Dataframes – There were some shortcomings on part of RDDs which the Spark DataFrame overcame in version 1.3 of Spark. First off there was no provision to handle structured data and there was no optimization engine when working with structured data. On the basis of attributes the developer had to optimize each RDD. Spark DataFrame is a distributed collection of data ordered into named columns. You might remember a table in relational database. Spark DataFrame is similar to that.
Spark SQL datasets – In the version 1.6 of Spark, Spark dataset was the interface that was added. The catch with this interface is that it provides the benefits of RDDs along with the benefits of optimized execution engine of Apache Spark SQL. To achieve conversion between JVM objects and tabular representation the concept of encoder is used. Using JVM objects a dataset can be incepted and functional transformations like map, filter etc have to be used to modify them. The Dataset API is available both in Scala and Java but is not supported in Python.
Spark Catalyst Optimizer – Catalyst optimizer is the optimizer used in Spark SQL and all the queries written by Spark SQL and DataFrame DSL is optimized by this tool. This optimizer is better than the RDD and hence the performance of the system is increased.

Features of Spark SQL

Let’s take a stroll into the aspects which make Spark SQL so popular in data processing.
Integrated – One can mix SQL queries with Spark programs easily. Structured data can be queried inside Spark programs using Spark SQL using either SQL or a Dataframe API. Running SQL queries alongside analytic algorithms is easy because of this tight integration.
Hive compatibility – Hive queries can be run as it is as Spark SQL supports HiveQL along with UDFs (user defined functions) and Hive SerDes. This allows one to access the existing Hive warehouses.
Unified data access – Loading and querying data from variety of sources is possible. One only needs a single interface to work with structured data which the schema-RDDs provide.


Standard connectivity – Spark SQL includes a server mode with high grade connectivity to JDBC or ODBC.
Performance and scalability – To make queries agile alongside computing hundreds of nodes using the Spark engine, Spark SQL incorporates a code generator, cost-based optimizer and columnar storage. This provides complete mid-query fault tolerance. Note that we discusses earlier in Hive limitations that this kind of tolerance was lacking in Hive. Spark has ample information regarding the structure of the data as well as the type of computation being performed which is provided by the interfaces of Spark SQL. This leads to extra optimization from Spark SQL internally. Faster execution of Hive queries is possible as Spark SQL can directly read from multiple sources like HDFS, Hive, existing RDDs etc.

Use cases

There is a lot to learn about Spark SQL as how it is applied in industry scenario but the below three use cases can give an apt idea:
Twitter sentiment analysis – Initially all data is got from Spark streaming. Later Spark SQL is used to analyze everything about a topic say Narendra Modi. Every tweet regarding Modi is got and then Spark SQL does its magic to classify tweets as neutral tweets, positive tweets, negative tweets, very positive tweets and very negative tweets. This is just one of the ways how sentiment analysis is done. This is useful in target marketing, crisis management and service adjusting.


Stock market analysis – Once you are streaming data in the real time you can also do the processing in the real time. Stock movements, market movement generate so much data and traders need an edge, an analytics framework which will calculate all the data in real time and provide the most rewarding stock or contract all within the nick of time. As said earlier if there is a need for real time analytics framework then Spark and its components is the technology to be considered.
Banking – Real time processing is required in credit card fraud detection. Assume a transaction happens in bangalore where there is a purchase of 4,000 rupees swiping a credit card. Within 5 minutes there is another purchase of 10,000 rupees in Kolkata swiping the same credit card. Banks can make use of real time analytics provided by Spark SQL in detecting the fraud.

Conclusion


Apache foundation has given a carefully thought out component for real time analytics. When the analytics world start seeing the shortcomings of Hadoop in providing real time analytics then migrating to Spark will be the obvious outcome. Similarly when the limitations of Hive become more and more apparent then users will obviously shift to Spark SQL. It is to be noted that the processing which takes 10 minutes to perform via Hive can be achieved in less than a minute if one uses Spark SQL. On top of that the migration is also easy as hive support is provided by Spark SQL. But here comes the great opportunity for those who want to learn Spark SQL and data frames. Currently there aren’t many professionals who can work around in Hadoop. The demand is still higher for Spark and those who learn it and have hands-on experience on it will be in great demand when the technology is used more and more in the future.

Comments

Popular posts from this blog

Let's Understand Ten Machine Learning Algorithms

Ten Machine Learning Algorithms to Learn Machine Learning Practitioners have different personalities. While some of them are “I am an expert in X and X can train on any type of data”, where X = some algorithm, some others are “Right tool for the right job people”. A lot of them also subscribe to “Jack of all trades. Master of one” strategy, where they have one area of deep expertise and know slightly about different fields of Machine Learning. That said, no one can deny the fact that as practicing Data Scientists, we will have to know basics of some common machine learning algorithms, which would help us engage with a new-domain problem we come across. This is a whirlwind tour of common machine learning algorithms and quick resources about them which can help you get started on them. 1. Principal Component Analysis(PCA)/SVD PCA is an unsupervised method to understand global properties of a dataset consisting of vectors. Covariance Matrix of data points is analyzed here to un...

gRPC with Java : Build Fast & Scalable Modern API & Microservices using Protocol Buffers

gRPC Java Master Class : Build Fast & Scalable Modern API for your Microservice using gRPC Protocol Buffers gRPC is a revolutionary and modern way to define and write APIs for your microservices. The days of REST, JSON and Swagger are over! Now writing an API is easy, simple, fast and efficient. gRPC is created by Google and Square, is an official CNCF project (like Docker and Kubernetes) and is now used by the biggest tech companies such as Netflix, CoreOS, CockRoachDB, and so on! gRPC is very popular and has over 15,000 stars on GitHub (2 times what Kafka has!). I am convinced that gRPC is the FUTURE for writing API for microservices so I want to give you a chance to learn about it TODAY. Amongst the advantage of gRPC: 1) All your APIs and messages are simply defined using Protocol Buffers 2) All your server and client code for any programming language gets generated automatically for free! Saves you hours of programming 3) Data is compact and serialised 4) API ...

What is Big Data ?

What is Big Data ? It is now time to answer an important question – What is Big Data? Big data, as defined by Wikipedia, is this: “Big data is a broad term for  data sets  so large or complex that traditional  data processing  applications are inadequate. Challenges include  analysis , capture,  data curation , search,  sharing ,  storage , transfer ,  visualization ,  querying  and  information privacy . The term often refers simply to the use of  predictive analytics  or certain other advanced methods to extract value from data, and seldom to a particular size of data set.” In simple terms, Big Data is data that has the 3 characteristics that we mentioned in the last section – • It is big – typically in terabytes or even petabytes • It is varied – it could be a traditional database, it could be video data, log data, text data or even voice data • It keeps increasing as new data keeps flowing in This kin...