Spark performs several optimizations, such as “pipelining” map transformations together to merge them, and converts the execution graph into a set of stages. Each stage, in turn, consists of multiple tasks. The tasks are bundled up and pre‐pared to be sent to the cluster. Tasks are the smallest unit of work in Spark; a typical user program can launch hundreds or thousands of individual tasks.
Scheduling tasks on executors
Given a physical execution plan, a Spark driver must coordinate the scheduling of individual tasks on executors. When executors are started they register them‐selves with the driver, so it has a complete view of the application’s executors at all times. Each executor represents a process capable of running tasks and storing RDD data.
|Spark Runtime Architecture|||||119|
Launching a Program
No matter which cluster manager you use, Spark provides a single script you can use to submit your program to it called spark-submit. Through various options, spark-submit can connect to different cluster managers and control how many resources your application gets. For some cluster managers, spark-submit can run the driver within the cluster (e.g., on a YARN worker node), while for others, it can run it only on your local machine. We’ll cover spark-submit in more detail in the next section.
3. The driver program contacts the cluster manager to ask for resources to launch executors.
4. The cluster manager launches executors on behalf of the driver program.