Running spark on kubernetes (k8s)

Thulasitharan Govindaraj
3 min readAug 17, 2022

Hi Folks,

Let’s see how we can run spark in K8s.

The base utils for spark is created from the docker images in this blog. ( https://thulasitharan-gt96.medium.com/how-to-dynamically-pass-commands-to-docker-images-while-running-using-spark-submit-here-fb4f76406cee ) , we use the spark base util from this and create an executor and driver image for both.

I will be using minikube and running this example. Installing minikube is east and the steps for that will be given in their own documentation. Kubectl is used to interact with minikube and this must also be installed.

Image for executor and driver will be the same, they will have spark installed in it and the required application jar copied inside it.

FROM spark-utils:v1
RUN mkdir -p /opt/spark/data-jars/
COPY jar.jar /opt/spark/data-jars/
ENTRYPOINT ["/opt/spark/kubernetes/dockerfiles/spark/entrypoint.sh"]

The ENTRYPOINT is important asit is required for spark. The spark application will use the entrypoint.sh which comes packed with the spark application to launch the containers by itself.

docker build -t spark-driver:v1
docker build -t spark-executor:v1

You need the below prerequisites.

  • Minikube
  • kubectl
  • kube proxy
  • spark binaries
  • A service account to be specifically configured for spark with required permissions

Service account for spark :

kubectl create serviceaccount spark # creationkubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=default:spark --namespace=default #RBAC assignment

Getting k8s URL:

kubectl cluster-info

Use the IP from the above command and submit the spark submit

spark-submit — master k8s://https://IP:PORT — deploy-mode cluster — name test-run-spark — conf spark.kubernetes.container.image=spark-executor:v1 — conf spark.kubernetes.driver.container.image=spark-driver:v1 — conf spark.kubernetes.executor.container.image=spark-executor:v1 — conf spark.kubernetes.driver.pod.name=test-run-spark — class [class] — conf spark.kubernetes.driver.volumes.hostPath.some-volume.mount.path=[PathInsidePodToWhichTheVolumeWillBeExposedTo] — conf spark.kubernetes.driver.volumes.hostPath.some-volume.options.path=[PathInsidePodToWhichTheVolumeWillBeExposedTo] — conf spark.kubernetes.executor.volumes.hostPath.some-volume.mount.path=[PathInsidePodToWhichTheVolumeWillBeExposedTo] — conf spark.kubernetes.executor.volumes.hostPath.some-volume.options.path=[PathInsidePodToWhichTheVolumeWillBeExposedTo] — num-executors 1 — executor-memory 512m — driver-memory 512m — driver-cores 2 — conf spark.kubernetes.namespace=default — executor-cores 2 — conf spark.kubernetes.driver.podTemplateFile=test-driver-spark.yml — conf spark.kubernetes.namespace=default — conf spark.kubernetes.executor.podNamePrefix=test-run-spark — conf spark.kubernetes.executor.podTemplateFile=test-executor-spark.yaml — conf spark.kubernetes.authenticate.driver.serviceAccountName=spark local:///opt/spark/data-jars/jar.jar [Args]

I have provided the Pod template because this spark application will write output to a location and I am trying to write it to my local system from the pod. We can achieve this using mount’s. There needs to be amount name given in order to give the parameters for spark submit with respect to the path in which the mount must be exposed inside the pod. The spark configuration parameter must contain the volume mount’s name.

Template file:

apiVersion: v1
kind: Pod
metadata:
name: test-driver-spark # change name for driver
spec:
containers:
- image: spark-driver:v1 # change name for driver
name: test-driver # change name for driver
volumeMounts:
- mountPath: [LocalPath]
name: some-volume #This name to be substituted in spark conf
volumes:
- name: some-volume #This name to be substituted in spark conf
hostPath:
path: [NodePath] # Mount path inside minikube VM

The template file can also be used to set more k8s specific configs that are pod related.

Expose the local path to minikube.

minikube mount [LocalPath]:[MinikubePath] #Dont kill this till the job is done

More info on this : https://jaceklaskowski.github.io/spark-kubernetes-book/demo/spark-and-local-filesystem-in-minikube/

There can also be different kinds of FS exposed to spark inside K8s : https://spark.apache.org/docs/latest/running-on-kubernetes.html

Now the spark application will be able to read and write form local FS in the provided path.

If your minikube environment is not able to pull images from docker repo but your local system is able to do it , then you need to configure the proxy specifically for it, or pull the image to local and load the image using the below command.

minikube image load [imageName]

--

--

Thulasitharan Govindaraj

Big data engineer - Spark Scala,Hadoop,HIVE,Impala,Kafka,AWS,HBASE,Snowflake,Deltalake,CDH,DOCKER,k8s . For the love of Formula one #SennaSempre