Home Top Spark RDD Interview Questions

Top Spark RDD Interview Questions

Answer:

RDD is the acronym for Resilient Distribution Datasets – a fault-tolerant collection of operational elements that run parallel.

  •  Spark allows Integration with Hadoop and files included in HDFS.
  • It has an independent language (Scala) interpreter and hence comes with an interactive language shell.
  • It consists of RDD’s (Resilient Distributed Datasets), that can be cached across computing nodes in a cluster.
  • It supports multiple analytic tools that are used for interactive query analysis, real-time analysis and graph processing.

Answer:

  • Spark allows Integration with Hadoop and files included in HDFS.
  • It has an independent language (Scala) interpreter and hence comes with an interactive language shell.
  • It consists of RDD’s (Resilient Distributed Datasets), that can be cached across computing nodes in a cluster.
  • It supports multiple analytic tools that are used for interactive query analysis, real-time analysis and graph processing.

Additionally, some of the salient features of Spark include:

Lighting fast processing: When it comes to Big Data processing, speed always matters, and Spark runs Hadoop clusters way faster than others. Spark makes this possible by reducing the number of read/write operations to the disc. It stores this intermediate processing data in memory.

Support for sophisticated analytics: In addition to simple “map” and “reduce” operations, Spark supports SQL queries, streaming data, and complex analytics such as machine learning and graph algorithms. This allows users to combine all these capabilities in a single workflow.

Answer:

Transformations

Actions

Answer:

“Transformations” are functions applied on RDD, resulting in a new RDD. It does not execute until an action occurs. map() and filer() are examples of “transformations”, where the former applies the function assigned to it on each element of the RDD and results in another RDD. The filter() creates a new RDD by selecting elements from the current RDD.

Answer:

Spark Engine is responsible for scheduling, distributing and monitoring the data application across the cluster.

Answer:

RDD stands for Resilient Distribution Datasets: a collection of fault-tolerant operational elements that run in parallel. The partitioned data in RDD is immutable and is distributed in nature.

Answer:

The “SparkCore” performs an array of critical functions like memory management, monitoring jobs, fault tolerance, job scheduling and interaction with storage systems.

It is the foundation of the overall project. It provides distributed task dispatching, scheduling, and basic input and output functionalities. RDD in Spark Core makes it fault tolerance. RDD is a collection of items distributed across many nodes that can be manipulated in parallel. Spark Core provides many APIs for building and manipulating these collections.

Answer: Spark does not support data replication in the memory. In the event of any data loss, it is rebuilt using the “RDD Lineage”. It is a process that reconstructs lost data partitions.

Answer:

“Spark Driver” is the program that runs on the master node of the machine and declares transformations and actions on data RDDs. The driver also delivers RDD graphs to the “Master”, where the standalone cluster manager runs.

Answer: “Accumulators” are Spark’s offline debuggers. Similar to “Hadoop Counters”, “Accumulators” provide the number of “events” in a program.

Accumulators are the variables that can be added through associative operations. Spark natively supports accumulators of numeric value types and standard mutable collections. “AggregrateByKey()” and “combineByKey()” uses accumulators.

Answer: Hadoop Distributed File System (HDFS)

Local File system

S3

Answer: Yes, it is possible if you use Spark Cassandra Connector.

Answer: “YARN” is a large-scale, distributed operating system for big data applications. It is one of the key features of Spark, providing a central and resource management platform to deliver scalable operations across the cluster.

Answer: Special operations can be performed on RDDs in Spark using key/value pairs and such RDDs are referred to as Pair RDDs. Pair RDDs allow users to access each key in parallel. They have a reduceByKey () method that collects data based on each key and a join () method that combines different RDDs together, based on the elements having the same key.

Answer: The Spark framework supports three kinds of Cluster Managers:

Standalone

Apache Mesos

YARN

Answer: A “Partition” is a smaller and logical division of data, that is similar to the “split” in Map Reduce. Partitioning is the process that helps derive logical units of data in order to speed up data processing.

Answer: Spark does not support data replication in the memory and thus, if any data is lost, it is rebuild using RDD lineage. RDD lineage is a process that reconstructs lost data partitions. The best is that RDD always remembers how to build from other datasets.

Answer: “Worker node” refers to any node that can run the application code in a cluster.

Answer: When “SparkContext” connects to a cluster manager, it acquires an “Executor” on the cluster nodes. “Executors” are Spark processes that run computations and store the data on the worker node. The final tasks by “SparkContext” are transferred to executors.