Industries are using Hadoop extensively to analyze their data sets. The reason is that Hadoop framework is based on a simple programming model (MapReduce) and it enables a computing solution that is scalable, flexible, fault-tolerant and cost effective. Here, the main concern is to maintain speed in processing large datasets in terms of waiting time between queries and waiting time to run the program.

Spark was introduced by Apache Software Foundation for speeding up the Hadoop computational computing software process.

As against a common belief, Spark is not a modified version of Hadoop and is not, really, dependent on Hadoop because it has its own cluster management. Hadoop is just one of the ways to implement Spark.

Spark uses Hadoop in two ways – one is storage and second is processing. Since Spark has its own cluster management computation, it uses Hadoop for storage purpose only.

Apache Spark

Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for more types of computations, which includes interactive queries and stream processing. The main feature of Spark is its in-memory cluster computing that increases the processing speed of an application.

Spark is designed to cover a wide range of workloads such as batch applications, iterative algorithms, interactive queries and streaming. Apart from supporting all these workload in a respective system, it reduces the management burden of maintaining separate tools.

Evolution of Apache Spark

Spark is one of Hadoop’s sub project developed in 2009 in UC Berkeley’s AMPLab by Matei Zaharia. It was Open Sourced in 2010 under a BSD license. It was donated to Apache software foundation in 2013, and now Apache Spark has become a top level Apache project from Feb-2014.

Features of Apache Spark

Apache Spark has subsequent features.

Speed − Spark helps to run an application in Hadoop cluster, up to 100 times faster in memory, and 10 times faster when running on disk. This is possible by reducing number of read/write operations to disk. It stores the intermediate processing subsequent in memory.
Supports multiple languages − Spark provides built-in APIs in Java, Scala, or Python. Therefore, you can write applications in different languages. Spark comes up with 80 high-level operators for interactive querying.
Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It also supports SQL queries, Streaming subsequent, Machine learning (ML), and Graph algorithms.

Spark Built on Hadoop

The subsequent diagram shows three ways of how Spark can be built with Hadoop components.

20.1 1

There are three ways of Spark deployment as explained below.

Standalone − Spark Standalone deployment means Spark occupies the place on top of HDFS(Hadoop Distributed File System) and space is allocated for HDFS, explicitly. Here, Spark and MapReduce will run side by side to cover all spark jobs on cluster.
Hadoop Yarn − Hadoop Yarn deployment means, simply, spark runs on Yarn without any pre-installation or root access required. It helps to integrate Spark into Hadoop ecosystem or Hadoop stack. It allows other components to run on top of stack.
Spark in MapReduce (SIMR) − Spark in MapReduce is used to launch spark job in addition to standalone deployment. With SIMR, user can start Spark and uses its shell without any administrative access.

Components of Spark

The subsequent illustration depicts the different components of Spark.

20.2

Apache Spark Core

Spark Core is the underlying general execution engine for spark platform that all other functionality is built upon. It provides In-Memory computing and referencing datasets in external storage systems.

Spark SQL

Spark SQL is a component on top of Spark Core that introduces a new subsequent abstraction called SchemaRDD, which provides support for structured and semi-structured subsequent.

Spark Streaming

Spark Streaming leverages Spark Core’s fast scheduling capability to perform streaming analytics. It ingests subsequent in mini-batches and performs RDD (Resilient Distributed Datasets) transformations on those mini-batches of subsequent.

MLlib (Machine Learning Library)

MLlib is a distributed machine learning framework above Spark because of the distributed memory-based Spark architecture. It is, according to benchmarks, done by the MLlib developers against the Alternating Least Squares (ALS) implementations. Spark MLlib is nine times as fast as the Hadoop disk-based version of Apache Mahout (before Mahout gained a Spark interface).

GraphX

GraphX is a distributed graph-processing framework on top of Spark. It provides an API for expressing graph computation that can model the user-defined graphs by using Pregel abstraction API. It also provides an optimized runtime for this abstraction.

So, this brings us to the end of blog. This Tecklearn ‘Overview of Apache Spark Framework’ helps you with commonly asked questions if you are looking out for a job in Apache Spark and Scala and Big Data Developer. If you wish to learn Apache Spark and Scala and build a career in Big Data Hadoop domain, then check out our interactive, Apache Spark and Scala Training, that comes with 24*7 support to guide you throughout your learning period. Please find the link for course details:

https://www.tecklearn.com/course/apache-spark-and-scala-certification/

Apache Spark and Scala Training

About the Course

Tecklearn Spark training lets you master real-time data processing using Spark streaming, Spark SQL, Spark RDD and Spark Machine Learning libraries (Spark MLlib). This Spark certification training helps you master the essential skills of the Apache Spark open-source framework and Scala programming language, including Spark Streaming, Spark SQL, machine learning programming, GraphX programming, and Shell Scripting Spark. You will also understand the role of Spark in overcoming the limitations of MapReduce. Upon completion of this online training, you will hold a solid understanding and hands-on experience with Apache Spark.

Why Should you take Apache Spark and Scala Training?

The average salary for Apache Spark developer ranges from approximately $93,486 per year for Developer to $128,313 per year for Data Engineer. – Indeed.com
Wells Fargo, Microsoft, Capital One, Apple, JPMorgan Chase & many other MNC’s worldwide use Apache Spark across industries.
Global Spark market revenue will grow to $4.2 billion by 2022 with a CAGR of 67% Marketanalysis.com

What you will Learn in this Course?

Introduction to Scala for Apache Spark

What is Scala
Why Scala for Spark
Scala in other Frameworks
Scala REPL
Basic Scala Operations
Variable Types in Scala
Control Structures in Scala
Loop, Functions and Procedures
Collections in Scala
Array Buffer, Map, Tuples, Lists

Functional Programming and OOPs Concepts in Scala

Functional Programming
Higher Order Functions
Anonymous Functions
Class in Scala
Getters and Setters
Custom Getters and Setters
Constructors in Scala
Singletons
Extending a Class using Method Overriding

Introduction to Spark

Introduction to Spark
How Spark overcomes the drawbacks of MapReduce
Concept of In Memory MapReduce
Interactive operations on MapReduce
Understanding Spark Stack
HDFS Revision and Spark Hadoop YARN
Overview of Spark and Why it is better than Hadoop
Deployment of Spark without Hadoop
Cloudera distribution and Spark history server

Basics of Spark

Spark Installation guide
Spark configuration and memory management
Driver Memory Versus Executor Memory
Working with Spark Shell
Resilient distributed datasets (RDD)
Functional programming in Spark and Understanding Architecture of Spark

Playing with Spark RDDs

Challenges in Existing Computing Methods
Probable Solution and How RDD Solves the Problem
What is RDD, It’s Operations, Transformations & Actions Data Loading and Saving Through RDDs
Key-Value Pair RDDs
Other Pair RDDs and Two Pair RDDs
RDD Lineage
RDD Persistence
Using RDD Concepts Write a Wordcount Program
Concept of RDD Partitioning and How It Helps Achieve Parallelization
Passing Functions to Spark

Writing and Deploying Spark Applications

Creating a Spark application using Scala or Java
Deploying a Spark application
Scala built application
Creating application using SBT
Deploying application using Maven
Web user interface of Spark application
A real-world example of Spark and configuring of Spark

Parallel Processing

Concept of Spark parallel processing
Overview of Spark partitions
File Based partitioning of RDDs
Concept of HDFS and data locality
Technique of parallel operations
Comparing coalesce and Repartition and RDD actions

Machine Learning using Spark MLlib

Why Machine Learning
What is Machine Learning
Applications of Machine Learning
Face Detection: USE CASE
Machine Learning Techniques
Introduction to MLlib
Features of MLlib and MLlib Tools
Various ML algorithms supported by MLlib

Integrating Apache Flume and Apache Kafka

Why Kafka, what is Kafka and Kafka architecture
Kafka workflow and Configuring Kafka cluster
Basic operations and Kafka monitoring tools
Integrating Apache Flume and Apache Kafka

Apache Spark Streaming

Why Streaming is Necessary
What is Spark Streaming
Spark Streaming Features
Spark Streaming Workflow
Streaming Context and DStreams
Transformations on DStreams
Describe Windowed Operators and Why it is Useful
Important Windowed Operators
Slice, Window and ReduceByWindow Operators
Stateful Operators

Improving Spark Performance

Learning about accumulators
The common performance issues and troubleshooting the performance problems

DataFrames and Spark SQL

Need for Spark SQL
What is Spark SQL
Spark SQL Architecture
SQL Context in Spark SQL
User Defined Functions
Data Frames and Datasets
Interoperating with RDDs
JSON and Parquet File Formats
Loading Data through Different Sources

Scheduling and Partitioning in Apache Spark

Concept of Scheduling and Partitioning in Spark
Hash partition and range partition
Scheduling applications
Static partitioning and dynamic sharing
Concept of Fair scheduling
Map partition with index and Zip
High Availability
Single-node Recovery with Local File System and High Order Functions

Got a question for us? Please mention it in the comments section and we will get back to you.

555