Apache Spark

Growing data has given rise to several open-source projects resulting in world-class frameworks. Apache Spark is one such open-source cluster computing framework, which originated in 2009 at Berkeley. This framework has gained popularity amongst developers and data scientists because of its speed. Let’s know more about this framework and how it compares with MapReduce.

What is Apache Spark?

Apache Spark is a fast and genereal engine for large-scale data processing

This is a framework that allows you to process large amounts of data much like MapReduce. Core components of Apache Spark:

  • Spark Core – Foundation of the project. Provides basic functionalities such as task dispatching, scheduling, I/O, APIs etc.
  • Spark SQL – For SQL and unstructured data processing
  • MLib – Machine Learning Algorithms
  • GraphX – Graph Processing
  • Spark Streaming – for streaming analysis

Apache spark
Apache Spark – Cluster Overview [Source: https://spark.apache.org/docs/latest/cluster-overview.html]
The driver program can connect to a variety of cluster manager programs including Apache YARN or Apache Mesos or Spark’s own program. This manager then handles the task allocation to each node in the cluster.

The major advantage of Apache Spark is in its memory based execution. It tries to complete all the data processing using only memory. It uses the disk only when memory is not available. This is a complete contrast with its counterpart – MapReduce. MapReduce is disk-based data processing framework. Due to this major difference, Apache Spark turns out to be 100x faster than MapReduce. It may result in a costlier infrastructure due to higher memory requirements. However, the speed may offset that cost for large datasets.

Spark can use the components of Hadoop ecosystem such as HDFS, HBase for storing the Resilient Distributed Datasets (RDDs) which make it fault-tolerant.

It also provides another advantage in terms of native binding for popular languages such as Python, Java, Scala, and R.

You may want to read this article which compares Hadoop (essentially MapReduce) with Spark on various points such as compatibility, performance, ease of use, costs etc.

Related Links

Related Keywords

Apache Hadoop, Big Data, Frameworks, MapReduce, Machine Learning

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.