Skip to main content

Apache Spark v/s Hadoop MapReduce

The term big data has created a lot of hype already in the business world. Hadoop and Spark are both big data frameworks; they provide some of the most popular tools used to carry out common big data-related tasks. In this article, we will cover the differences between Spark and Hadoop MapReduce.

Introduction

Spark: It is an open-source big data framework. It provides a faster and more general-purpose data processing engine. Spark is basically designed for fast computation. It also covers a wide range of workloads — for example, batch, interactive, iterative, and streaming.
Hadoop MapReduce: It is also an open-source framework for writing applications. It also processes structured and unstructured data that are stored in HDFS. Hadoop MapReduce is designed in a way to process a large volume of data on a cluster of commodity hardware. MapReduce can process data in batch mode.

Data Processing

Spark: Apache Spark is a good fit for both batch processing and stream processing, meaning it’s a hybrid processing framework. Spark speeds up batch processing via in-memory computation and processing optimization. It’s a nice alternative for streaming workloads, interactive queries, and machine learning. Spark can also work with Hadoop and its modules. Its real-time data processing capability makes Spark a top choice for big data analytics.
Its resilient distributed dataset (RDD) allows Spark to transparently store data in-memory and send to disk only what’s important or needed. As a result, a lot of time that's spent on the disk read and write is saved.
Hadoop: Apache Hadoop provides batch processing. Hadoop develops a great deal in creating new algorithms and component stack to improve access to large scale batch processing.
MapReduce is Hadoop’s native batch processing engine. Several components or layers (like YARN, HDFS, etc.) in modern versions of Hadoop allow easy processing of batch data. Since MapReduce is about permanent storage, it stores data on-disk, which means it can handle large datasets. MapReduce is scalable and has proved its efficacy to deal with tens of thousands of nodes. However, Hadoop’s data processing is slow as MapReduce operates in various sequential steps.

Real-Time Analysis

Spark: It can process real-time data, i.e. data coming from real-time event streams at the rate of millions of events per second, such as Twitter and Facebook data. Spark’s strength lies in its ability to process live streams efficiently.
Hadoop MapReduce: MapReduce fails when it comes to real-time data processing, as it was designed to perform batch processing on voluminous amounts of data.

Ease of Use

Spark: Spark is easier to use than Hadoop, as it comes with user-friendly APIs for Scala (its native language), Java, Python, and Spark SQL. Since Spark provides a way to perform streaming, batch processing, and machine learning in the same cluster, users find it easy to simplify their infrastructure for data processing. An interactive REPL (Read-Eval-Print Loop) allows Spark users to get instant feedback for commands.
Hadoop: Hadoop, on the other hand, is written in Java, is difficult to program, and requires abstractions. Although there is no interactive mode available with Hadoop MapReduce, tools like Pig and Hive make it easier for adopters to work with it.

Graph Processing

Spark: Spark comes with a graph computation library called GraphX to make things simple. In-memory computation coupled with in-built graph support allows the algorithm to perform much better than traditional MapReduce programs. Netty and Akka make it possible for Spark to distribute messages throughout the executors.
Hadoop: Most processing algorithms, like PageRank, perform multiple iterations over the same data. MapReduce reads data from the disk and, after a particular iteration, sends results to the HDFS, and then again reads the data from the HDFS for the next iteration. Such a process increases latency and makes graph processing slow.
In order to evaluate the score of a particular node, message passing needs to contain scores of neighboring nodes. These computations require messages from its neighbors, but MapReduce doesn’t have any mechanism for that. Although there are fast and scalable tools like Pregel and GraphLab for efficient graph processing algorithms, they aren't suitable for complex multi-stage algorithms.

Fault Tolerance

Spark: Spark uses RDD and various data storage models for fault tolerance by minimizing network I/O. In the event of partition loss of an RDD, the RDD rebuilds that partition through the information it already has. So, Spark does not use the replication concept for fault tolerance.
Hadoop: Hadoop achieves fault tolerance through replication. MapReduce uses TaskTracker and JobTracker for fault tolerance. However, TaskTracker and JobTracker have been replaced in the second version of MapReduce by Node Manager and ResourceManager/ApplicationMaster, respectively.

Security

Spark: Spark’s security is currently in its infancy, offering only authentication support through shared secret (password authentication). However, organizations can run Spark on HDFS to take advantage of HDFS ACLs and file-level permissions.
Hadoop MapReduce: Hadoop MapReduce has better security features than Spark. Hadoop supports Kerberos authentication, which is a good security feature but difficult to manage. Hadoop MapReduce can also integrate with Hadoop security projects, like Knox Gateway and Sentry. Third-party vendors also allow organizations to use Active Directory Kerberos and LDAP for authentication. Hadoop’s Distributed File System is compatible with access control lists (ACLs) and a traditional file permissions model.

Cost

Both Hadoop and Spark are open-source projects, therefore come for free. However, Spark uses large amounts of RAM to run everything in-memory, and RAM is more expensive than hard disks. Hadoop is disk-bound, so saves the costs of buying expensive RAM, but requires more systems to distribute the disk I/O over multiple systems.
As far as costs are concerned, organizations need to look at their requirements. If it’s about processing large amounts of big data, Hadoop will be cheaper since hard disk space comes at a much lower rate than memory space.

Compatibility

Both Hadoop and Spark are compatible with each other. Spark can integrate with all the data sources and file formats that are supported by Hadoop. So, it’s not wrong to say that Spark’s compatibility with data types and data sources is similar to that of Hadoop MapReduce.
Both Hadoop and Spark are scalable. One may think of Spark as a better choice than Hadoop. However, MapReduce turns out to be a good choice for businesses that need huge datasets brought under control by commodity systems. Both frameworks are good in their own sense. Hadoop has its own file system that Spark lacks, and Spark provides a way for real-time analytics that Hadoop does not possess.
Hence, the differences between Apache Spark vs. Hadoop MapReduce shows that Apache Spark is much more advanced cluster computing engine than MapReduce. Spark can handle any type of requirements (i.e. batch, interactive, iterative, streaming, graph) while MapReduce limits to batch processing.

Comments

Popular posts from this blog

EVENT DRIVEN MICROSERVICES

EVENT BASED MICROSERVICES - Event Sourcing In a Microservice Architecture, especially with Database per Microservice, the Microservices need to exchange data. For resilient, highly scalable, and fault-tolerant systems, they should communicate asynchronously by exchanging Events. In such a case, you may want to have Atomic operations, e.g., update the Database and send the message. If you have SQL databases and want to have distributed transactions for a high volume of data, you cannot use the two-phase locking (2PL) as it does not scale. If you use NoSQL Databases and want to have a distributed transaction, you cannot use 2PL as many NoSQL databases do not support two-phase locking. In such scenarios, use Event based Architecture with Event Sourcing. In traditional databases, the Business Entity with the current “state” is directly stored. In Event Sourcing, any state-changing event or other significant events are stored instead of the entities. It means the modifications of a Busines...

Recommendation Engines - Know How

Recommendation Engines perform a variety of tasks - but the most important one is to find products that are most relevant to the user. Content based filtering, collaborative filtering and Association rules are common approaches to do so. So let's first  Understand basics of Recommendation Engines and then we'll later on Build Our Own Recommendation Engine !!! HIGH QUALITY, PERSONALIZED  ARE THE HOLY GRAIL FOR EVERY ONLINE STORE. UNLIKE OFFLINE STORES,  ONLINE STORES HAVE NO SALES PEOPLE. USERS ON THE OTHER HAND  HAVE LIMITED TIME AND PATIENCE,  ARE NOT SURE WHAT THEY ARE LOOKING FOR  ONLINE STORES HAVE A HUGE NUMBER OF  PRODUCTS. RECOMMENDATIONS HELP USERS  NAVIGATE THE MAZE OF ONLINE STORES  FIND WHAT THEY ARE LOOKING FOR  FIND THINGS THEY MIGHT LIKE, BUT DIDN’T KNOW OF. RECOMMENDATIONS HELP ONLINE STORES  SOLVE THE PROBLEM OF DISCOVERY. BUT HOW? Lets Explain this. ONLINE STORES HAVE DATA 1) WHAT USERS  BOUGHT 2)...

KAFKA - Architecture

Kafka - Architecture What is Kafka? Kafka is an event-streaming platform that is designed to process high volumes of data in real-time. Developed by LinkedIn in 2011, it has quickly become the infrastructural backbone of companies like Netflix, Twitter, and Spotify. Why do we need Kafka? In today’s data-driven world, tracking information like user clicks, recommendations, and shopping carts can be invaluable for a company’s growth. With these analytics, companies can make the product improvements needed to boost user engagement and conversion rates. However, on sites with millions of daily users, collecting and analyzing this data is nontrivial. Kafka was des i gned to streamline this operation, acting as a robust tool that maintains efficient, real-time processing capabilities with incredible quantities of data. For instance, as of late 2019, LinkedIn’s Kafka deployments were managing more than 7 trillion messages per day. How does Kafka work? Kafka provides a structured architecture ...