What is Airflow?
Airflow is an open-source platform used to schedule and monitor workflows. It provides a web interface where you can launch, start or stop execution DAGs, and see the logs and status of each task.
A DAG is a sequence of tasks that need to be executed. Airflow provides integration with different platforms.
Airflow can be installed in a Kubernetes cluster, where the different components needed for airflow are installed as independent pods.
What is Spark?
Apache Spark is a framework for processing large-scale data processing in which you can run your code in Java, Scala, Python, or R.
Spark Tools also provides additionals tools like Spark SQL for SQL, MLlib for machine learning, GraphX for graph processing, and Structured Streaming.
Spark Cluster
Spark can run in cluster mode as independent tasks. These tasks are coordinated by the SparkContext which can be connected to different types of clusters.
It consists of a master-slave architecture, where the worker nodes (slaves) are in charge of the execution of the tasks. The cluster manager is in charge of dividing the work into tasks to be processed. Here's how a spark application is executed:
Spark Context: connects to a spark execution environment.
Cluster Manager:
- Schedules spark application.
- Allocates resources to the driver program to run tasks.
- Splits into tasks and distribute across worker nodes.
Worker Node: executes the task assigned.
There are different cluster manager types for running a spark cluster. You can run it as a standalone node, which is useful for creating a small cluster when you only have a Spark workload.
When you need to create a bigger cluster, it's better to use a more complex architecture that resolves problems like scheduling and monitoring the applications.
The following cluster manager types are available:
- Hadoop YARN – a resource manager to run applications.
- Apache Mesos – a general cluster manager to run frameworks.
- Kubernetes – used to deploy and manage containerized applications.
Running Spark on Kubernetes
Since we have the Kubernetes cluster for Airflow, it makes sense to run everything in the same cluster. This is how spark on Kubernetes works:
- Runs with spark-submit from outside or inside the cluster.
- The request goes to the API Server (Kubernetes master).
- A pod is created for the Spark driver.
- The Spark driver requests executors to run the task.
- The pods' executors are run and they then run the application code.
- When the application finishes, the executor pods are killed but the driver persists the logs to track the results.
Docker Image
To run a task in Kubernetes, you need to provide a docker image which will be executed in the executors' pods. Spark contains a Dockerfile to build an image base or you can also create your own custom image.
When you run the application, you can set the number of executors you would like to use.
If the dynamic allocation is enabled, you can let Kubernetes start executors on demand.
What is Apache Livy?
Apache Livy is a REST service for submitting Spark Jobs Spark Clusters. With Apache Livy you can:
- Enjoy an easy submission of Spark jobs.
- Configure to ensure security via authentication.
- Enable sharing cache and RDDs between spark jobs.
- Manage multiple spark contexts.
- Utilize a web interface to track jobs.
Architecture
From Airflow, tasks can be executed in the Spark Cluster using Apache Livy.
Deployment in EKS
Airflow
In this article you can find the instructions to deploy Airflow in EKS, using this repo. You will need to use the EFS CSI driver for the persistence volume as it supports multiple nodes read-write at the same time.
Spark
The Spark cluster runs in the same Kubernetes cluster and shares the volume to store intermediate results. The Spark nodes are created on-demand when needed and as independent pods in the cluster.
Installing Apache Livy
In order to install apache Livy, you will need to use this repo and complete these steps:
Livy Operator
To support Livy Operator in Airflow, you will need the following dependency as described here.
To support this, the image [rootstrap/eks-airflow:2.1.2] located here, is then created via this repo.
Side note, if you want to add more dependencies just use that Dockerfile by adding:
To do this, build and push the image, and update the image's name and version at the [values.yaml] file here.
You can do this by changing the values defaultAirflowRepository and defaultAirflowTag with the corresponding values:
After modifying values.yaml, you then upgrade the chart with the new image:
Livy Connection
You can configure an apache Livy connection in the Connections section of the Airflow web console.
When doing so, add a Connection with these parameters:
- Conn Id: [.c-inline-code]livy_conn_id[.c-inline-code]
- Conn Type: [.c-inline-code]Apache Livy[.c-inline-code]
- Description: [.c-inline-code]Apache Livy REST API[.c-inline-code]
- Host: [.c-inline-code]get the ClusterIP for apache-livy executing:
kubectl get services | grep apache-livy | awk '{print $3}'[.c-inline-code]
- Port: [.c-inline-code]8998[.c-inline-code]
From the DAG you can create a task that invokes Apache Livy using the configured connection. You can see an example of this DAG here.
Conclusion
With this aforementioned architecture we effectively achieved the following results:
- Isolation of tasks.
- Horizontal and Vertical scalability.
- Faster processing with spark.
- Traceability.
Thank you for reading and stay tuned for more similar content.