Running Sparkling Water in Kubernetes

Sparkling Water can be executed inside the Kubernetes cluster. Sparkling Water supports Kubernetes since Spark version 2.4.

Before we start, please check the following:

  1. Please make sure we are familiar with how to run Spark on Kubernetes at Spark Kubernetes documentation.

  2. Ensure that we have a working Kubernetes Cluster and kubectl installed

  3. Ensure we have SPARK_HOME set up to a home directory of our Spark distribution of version 2.4.8

  4. Run kubectl cluster-info to obtain Kubernetes master URL.

  5. Have internet connection so Kubernetes can download Sparkling Water docker images

  6. If we have some non-default network policies applied to the namespace where Sparkling Water is supposed to run, make sure that the following ports are exposed: all Spark ports and ports 54321 and 54322 as these are also necessary by H2O to be able to communicate.

The examples below are using the default Kubernetes namespace which we enable for Spark as:

kubectl create clusterrolebinding default --clusterrole=edit --serviceaccount=default:default --namespace=default

We can also use different namespace setup for Spark. In that case please don’t forget to pass --conf spark.kubernetes.authenticate.driver.serviceAccountName=serviceName to our Spark commands.

Internal Backend

In the internal backend of Sparkling Water, we need to pass the option spark.scheduler.minRegisteredResourcesRatio=1 to our Spark job invocation. This ensures that Spark waits for all resources and therefore Sparkling Water will start H2O on all requested executors.

Dynamic allocation must be disabled in Spark.

  • Scala
  • Python
  • R

Both cluster and client deployment modes of Kubernetes are supported.

To submit Scala job in a cluster mode, run:

$SPARK_HOME/bin/spark-submit \
--master "k8s://KUBERNETES_ENDPOINT" \
--deploy-mode cluster \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-scala:3.38.0.2-1-2.4 \
--conf spark.executor.instances=3 \
--conf spark.driver.host=sparkling-water-app \
--conf spark.kubernetes.driver.pod.name=sparkling-water-app \
--class ai.h2o.sparkling.KubernetesTest \
local:///opt/sparkling-water/tests/kubernetesTest.jar

To start an interactive shell in a client mode:

  1. Create Headless service so Spark executors can reach the driver node

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Service
metadata:
  name: sparkling-water-app
spec:
  clusterIP: "None"
  selector:
    spark-driver-selector: sparkling-water-app
EOF
  1. Start pod from where we run the shell:

kubectl run -n default -i --tty sparkling-water-app --restart=Never --labels spark-driver-selector=sparkling-water-app --image=h2oai/sparkling-water-scala:3.38.0.2-1-2.4 -- /bin/bash
  1. Inside the container, start the shell:

$SPARK_HOME/bin/spark-shell \
 --master "k8s://KUBERNETES_ENDPOINT" \
 --deploy-mode client \
 --conf spark.scheduler.minRegisteredResourcesRatio=1 \
 --conf spark.kubernetes.container.image=h2oai/sparkling-water-scala:3.38.0.2-1-2.4 \
 --conf spark.executor.instances=3 \
 --conf spark.driver.host=sparkling-water-app \
 --conf spark.kubernetes.driver.pod.name=sparkling-water-app
  1. Inside the shell, run:

import ai.h2o.sparkling._
val hc = H2OContext.getOrCreate()
  1. To access flow, we need to enable port-forwarding from the driver pod:

kubectl port-forward sparkling-water-app 54321:54321

To submit a batch job using client mode:

First, create the headless service as mentioned in the step 1 above and run:

kubectl run -n default -i --tty sparkling-water-app --restart=Never --labels spark-driver-selector=sparkling-water-app --image=h2oai/sparkling-water-scala:3.38.0.2-1-2.4 -- \
$SPARK_HOME/bin/spark-submit \
--master "k8s://KUBERNETES_ENDPOINT" \
--deploy-mode client \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-scala:3.38.0.2-1-2.4 \
--conf spark.executor.instances=3 \
--conf spark.driver.host=sparkling-water-app \
--conf spark.kubernetes.driver.pod.name=sparkling-water-app \
--class ai.h2o.sparkling.KubernetesTest \
local:///opt/sparkling-water/tests/kubernetesTest.jar

Manual Mode of External Backend

Sparkling Water External backend can be also used in Kubernetes. First, we need to start an external H2O backend on Kubernetes. To achieve this, please follow the steps on the H2O on Kubernetes Documentation with one important exception. The image to be used need to be h2oai/sparkling-water-external-backend:3.38.0.2-1-2.4 and not the base H2O image as mentioned in H2O documentation as Sparkling Water enhances the H2O image with additional dependencies.

In order for Sparkling Water to be able to connect to the H2O cluster, we need to get the address of the leader node of the H2O cluster. If we followed the H2O documentation on how to start H2O cluster on Kubernetes, the address is h2o-service.default.svc.cluster.local:54321 where the first part is the H2O headless service name and the second part is the name of the namespace.

After we created the external H2O backend, we can connect to it from Sparkling Water clients as:

  • Scala
  • Python
  • R

Both cluster and client deployment modes of Kubernetes are supported.

To submit Scala job in a cluster mode, run:

$SPARK_HOME/bin/spark-submit \
--master "k8s://KUBERNETES_ENDPOINT" \
--deploy-mode cluster \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-scala:3.38.0.2-1-2.4 \
--conf spark.executor.instances=2 \
--conf spark.driver.host=sparkling-water-app \
--conf spark.kubernetes.driver.pod.name=sparkling-water-app \
--conf spark.ext.h2o.backend.cluster.mode=external \
--conf spark.ext.h2o.external.start.mode=manual \
--conf spark.ext.h2o.external.memory=2G \
--conf spark.ext.h2o.cloud.representative=h2o-service.default.svc.cluster.local:54321 \
--conf spark.ext.h2o.cloud.name=root \
--class ai.h2o.sparkling.KubernetesTest \
local:///opt/sparkling-water/tests/kubernetesTest.jar

To start an interactive shell in a client mode:

  1. Create Headless service so Spark executors can reach the driver node

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Service
metadata:
  name: sparkling-water-app
spec:
  clusterIP: "None"
  selector:
    spark-driver-selector: sparkling-water-app
EOF
  1. Start pod from where we run the shell:

kubectl run -n default -i --tty sparkling-water-app --restart=Never --labels spark-driver-selector=sparkling-water-app --image=h2oai/sparkling-water-scala:3.38.0.2-1-2.4 -- /bin/bash
  1. Inside the container, start the shell:

$SPARK_HOME/bin/spark-shell \
--master "k8s://KUBERNETES_ENDPOINT" \
--deploy-mode client \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-scala:3.38.0.2-1-2.4 \
--conf spark.executor.instances=2 \
--conf spark.driver.host=sparkling-water-app \
--conf spark.kubernetes.driver.pod.name=sparkling-water-app \
--conf spark.ext.h2o.backend.cluster.mode=external \
--conf spark.ext.h2o.external.start.mode=manual \
--conf spark.ext.h2o.external.memory=2G \
--conf spark.ext.h2o.cloud.representative=h2o-service.default.svc.cluster.local:54321 \
--conf spark.ext.h2o.cloud.name=root
  1. Inside the shell, run:

import ai.h2o.sparkling._
val hc = H2OContext.getOrCreate()
  1. To access flow, we need to enable port-forwarding from the driver pod:

kubectl port-forward sparkling-water-app 54321:54321

To submit a batch job using client mode:

First, create the headless service as mentioned in the step 1 above and run:

kubectl run -n default -i --tty sparkling-water-app --restart=Never --labels spark-driver-selector=sparkling-water-app --image=h2oai/sparkling-water-scala:3.38.0.2-1-2.4 -- \
$SPARK_HOME/bin/spark-submit \
--master "k8s://KUBERNETES_ENDPOINT" \
--deploy-mode client \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-scala:3.38.0.2-1-2.4 \
--conf spark.executor.instances=2 \
--conf spark.driver.host=sparkling-water-app \
--conf spark.kubernetes.driver.pod.name=sparkling-water-app \
--conf spark.ext.h2o.backend.cluster.mode=external \
--conf spark.ext.h2o.external.start.mode=manual \
--conf spark.ext.h2o.external.memory=2G \
--conf spark.ext.h2o.cloud.representative=h2o-service.default.svc.cluster.local:54321 \
--conf spark.ext.h2o.cloud.name=root \
--class ai.h2o.sparkling.KubernetesTest \
local:///opt/sparkling-water/tests/kubernetesTest.jar

Automatic Mode of External Backend

In the automatic mode, Sparkling Water starts external H2O on Kubernetes automatically. The requirement is that the driver node is configured to communicate with the Kubernetes cluster. Docker image for the external H2O backend is specified using the spark.ext.h2o.external.k8s.docker.image option.

  • Scala
  • Python
  • R

Both cluster and client deployment modes of Kubernetes are supported.

To submit Scala job in a cluster mode, run:

$SPARK_HOME/bin/spark-submit \
--master "k8s://KUBERNETES_ENDPOINT" \
--deploy-mode cluster \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-scala:3.38.0.2-1-2.4 \
--conf spark.executor.instances=2 \
--conf spark.driver.host=sparkling-water-app \
--conf spark.kubernetes.driver.pod.name=sparkling-water-app \
--conf spark.ext.h2o.backend.cluster.mode=external \
--conf spark.ext.h2o.external.start.mode=auto \
--conf spark.ext.h2o.external.auto.start.backend=kubernetes \
--conf spark.ext.h2o.external.cluster.size=2 \
--conf spark.ext.h2o.external.memory=2G \
--conf spark.ext.h2o.external.k8s.docker.image=h2oai/sparkling-water-external-backend:3.38.0.2-1-2.4 \
--class ai.h2o.sparkling.KubernetesTest \
local:///opt/sparkling-water/tests/kubernetesTest.jar

To start an interactive shell in a client mode:

  1. Create Headless service so Spark executors can reach the driver node

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Service
metadata:
  name: sparkling-water-app
spec:
  clusterIP: "None"
  selector:
    spark-driver-selector: sparkling-water-app
EOF
  1. Start pod from where we run the shell:

kubectl run -n default -i --tty sparkling-water-app --restart=Never --labels spark-driver-selector=sparkling-water-app --image=h2oai/sparkling-water-scala:3.38.0.2-1-2.4 -- /bin/bash
  1. Inside the container, start the shell:

$SPARK_HOME/bin/spark-shell \
--master "k8s://KUBERNETES_ENDPOINT" \
--deploy-mode client \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-scala:3.38.0.2-1-2.4 \
--conf spark.executor.instances=2 \
--conf spark.driver.host=sparkling-water-app \
--conf spark.kubernetes.driver.pod.name=sparkling-water-app \
--conf spark.ext.h2o.backend.cluster.mode=external \
--conf spark.ext.h2o.external.start.mode=auto \
--conf spark.ext.h2o.external.auto.start.backend=kubernetes \
--conf spark.ext.h2o.external.cluster.size=2 \
--conf spark.ext.h2o.external.memory=2G \
--conf spark.ext.h2o.external.k8s.docker.image=h2oai/sparkling-water-external-backend:3.38.0.2-1-2.4
  1. Inside the shell, run:

import ai.h2o.sparkling._
val hc = H2OContext.getOrCreate()
  1. To access flow, we need to enable port-forwarding from the driver pod:

kubectl port-forward sparkling-water-app 54321:54321

To submit a batch job using client mode:

First, create the headless service as mentioned in the step 1 above and run:

kubectl run -n default -i --tty sparkling-water-app --restart=Never --labels spark-driver-selector=sparkling-water-app --image=h2oai/sparkling-water-scala:3.38.0.2-1-2.4 -- \
$SPARK_HOME/bin/spark-submit \
--master "k8s://KUBERNETES_ENDPOINT" \
--deploy-mode client \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-scala:3.38.0.2-1-2.4 \
--conf spark.executor.instances=2 \
--conf spark.driver.host=sparkling-water-app \
--conf spark.kubernetes.driver.pod.name=sparkling-water-app \
--conf spark.ext.h2o.backend.cluster.mode=external \
--conf spark.ext.h2o.external.start.mode=auto \
--conf spark.ext.h2o.external.auto.start.backend=kubernetes \
--conf spark.ext.h2o.external.cluster.size=2 \
--conf spark.ext.h2o.external.memory=2G \
--conf spark.ext.h2o.external.k8s.docker.image=h2oai/sparkling-water-external-backend:3.38.0.2-1-2.4 \
--class ai.h2o.sparkling.KubernetesTest \
local:///opt/sparkling-water/tests/kubernetesTest.jar