Running Sparkling Water in Kubernetes

Sparkling Water can be executed inside the Kubernetes cluster. Sparkling Water supports Kubernetes since Spark version 2.4.

Before we start, please check the following:

  1. Please make sure we are familiar with how to run Spark on Kubernetes at Spark Kubernetes documentation.

  2. Ensure that we have a working Kubernetes Cluster and kubectl installed

  3. Ensure we have SPARK_HOME set up to a home directory of our Spark distribution of version 3.0.3

  4. Run kubectl cluster-info to obtain Kubernetes master URL.

  5. Have internet connection so Kubernetes can download Sparkling Water docker images

  6. If we have some non-default network policies applied to the namespace where Sparkling Water is supposed to run, make sure that the following ports are exposed: all Spark ports and ports 54321 and 54322 as these are also necessary by H2O to be able to communicate.

The examples below are using the default Kubernetes namespace which we enable for Spark as:

kubectl create clusterrolebinding default --clusterrole=edit --serviceaccount=default:default --namespace=default

We can also use different namespace setup for Spark. In that case please don’t forget to pass --conf spark.kubernetes.authenticate.driver.serviceAccountName=serviceName to our Spark commands.

Internal Backend

In the internal backend of Sparkling Water, we need to pass the option spark.scheduler.minRegisteredResourcesRatio=1 to our Spark job invocation. This ensures that Spark waits for all resources and therefore Sparkling Water will start H2O on all requested executors.

Dynamic allocation must be disabled in Spark.

Scala

Both cluster and client deployment modes of Kubernetes are supported.

To submit Scala job in a cluster mode, run:

$SPARK_HOME/bin/spark-submit \
--master "k8s://KUBERNETES_ENDPOINT" \
--deploy-mode cluster \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-scala:3.34.0.4-1-3.0 \
--conf spark.executor.instances=3 \
--conf spark.driver.host=sparkling-water-app \
--conf spark.kubernetes.driver.pod.name=sparkling-water-app \
--class ai.h2o.sparkling.KubernetesTest \
local:///opt/sparkling-water/tests/kubernetesTest.jar

To start an interactive shell in a client mode:

  1. Create Headless service so Spark executors can reach the driver node

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Service
metadata:
  name: sparkling-water-app
spec:
  clusterIP: "None"
  selector:
    spark-driver-selector: sparkling-water-app
EOF
  1. Start pod from where we run the shell:

kubectl run -n default -i --tty sparkling-water-app --restart=Never --labels spark-driver-selector=sparkling-water-app --image=h2oai/sparkling-water-scala:3.34.0.4-1-3.0 -- /bin/bash
  1. Inside the container, start the shell:

$SPARK_HOME/bin/spark-shell \
 --master "k8s://KUBERNETES_ENDPOINT" \
 --deploy-mode client \
 --conf spark.scheduler.minRegisteredResourcesRatio=1 \
 --conf spark.kubernetes.container.image=h2oai/sparkling-water-scala:3.34.0.4-1-3.0 \
 --conf spark.executor.instances=3 \
 --conf spark.driver.host=sparkling-water-app \
 --conf spark.kubernetes.driver.pod.name=sparkling-water-app
  1. Inside the shell, run:

import ai.h2o.sparkling._
val hc = H2OContext.getOrCreate()
  1. To access flow, we need to enable port-forwarding from the driver pod:

kubectl port-forward sparkling-water-app 54321:54321

To submit a batch job using client mode:

First, create the headless service as mentioned in the step 1 above and run:

kubectl run -n default -i --tty sparkling-water-app --restart=Never --labels spark-driver-selector=sparkling-water-app --image=h2oai/sparkling-water-scala:3.34.0.4-1-3.0 -- \
$SPARK_HOME/bin/spark-submit \
--master "k8s://KUBERNETES_ENDPOINT" \
--deploy-mode client \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-scala:3.34.0.4-1-3.0 \
--conf spark.executor.instances=3 \
--conf spark.driver.host=sparkling-water-app \
--conf spark.kubernetes.driver.pod.name=sparkling-water-app \
--class ai.h2o.sparkling.KubernetesTest \
local:///opt/sparkling-water/tests/kubernetesTest.jar

Python

Both cluster and client deployment modes of Kubernetes are supported.

To submit Python job in a cluster mode, run:

$SPARK_HOME/bin/spark-submit \
--master "k8s://KUBERNETES_ENDPOINT" \
--deploy-mode cluster \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-python:3.34.0.4-1-3.0 \
--conf spark.executor.instances=3 \
--conf spark.driver.host=sparkling-water-app \
--conf spark.kubernetes.driver.pod.name=sparkling-water-app \
local:///opt/sparkling-water/tests/initTest.py

To start an interactive shell in a client mode:

  1. Create Headless service so Spark executors can reach the driver node:

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Service
metadata:
  name: sparkling-water-app
spec:
  clusterIP: "None"
  selector:
    spark-driver-selector: sparkling-water-app
EOF
  1. Start pod from where we run the shell:

kubectl run -n default -i --tty sparkling-water-app --restart=Never --labels spark-driver-selector=sparkling-water-app --image=h2oai/sparkling-water-python:3.34.0.4-1-3.0 -- /bin/bash
  1. Inside the container, start the shell:

$SPARK_HOME/bin/pyspark \
--master "k8s://KUBERNETES_ENDPOINT" \
--deploy-mode client \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-python:3.34.0.4-1-3.0 \
--conf spark.executor.instances=3 \
--conf spark.driver.host=sparkling-water-app \
--conf spark.kubernetes.driver.pod.name=sparkling-water-app
  1. Inside the shell, run:

from pysparkling import *
hc = H2OContext.getOrCreate()
  1. To access flow, we need to enable port-forwarding from the driver pod as:

kubectl port-forward sparkling-water-app 54321:54321

To submit a batch job using client mode:

First, create the headless service as mentioned in the step 1 above and run:

kubectl run -n default -i --tty sparkling-water-app --restart=Never --labels spark-driver-selector=sparkling-water-app --image=h2oai/sparkling-water-python:3.34.0.4-1-3.0 -- \
$SPARK_HOME/bin/spark-submit \
--master "k8s://KUBERNETES_ENDPOINT" \
--deploy-mode client \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-python:3.34.0.4-1-3.0 \
--conf spark.executor.instances=3 \
--conf spark.driver.host=sparkling-water-app \
--conf spark.kubernetes.driver.pod.name=sparkling-water-app \
local:///opt/sparkling-water/tests/initTest.py

R

First, make sure that RSparkling is installed on the node we want to run RSparkling from. You can install RSparkling as:

# Download, install, and initialize the H2O package for R.
# In this case we are using rel-zizler 4 (3.34.0.4)
install.packages("h2o", type = "source", repos = "http://h2o-release.s3.amazonaws.com/h2o/rel-zizler/4/R")

# Download, install, and initialize the RSparkling
install.packages("rsparkling", type = "source", repos = "http://h2o-release.s3.amazonaws.com/sparkling-water/spark-3.0/3.34.0.4-1-3.0/R")

To start H2OContext in an interactive shell, run the following code in R or RStudio:

library(sparklyr)
library(rsparkling)
config <- spark_config_kubernetes("k8s://KUBERNETES_ENDPOINT",
                 image = "h2oai/sparkling-water-r:3.34.0.4-1-3.0",
                 account = "default",
                 executors = 3,
                 conf = list("spark.kubernetes.file.upload.path"="file:///tmp"),
                 version = "3.0.3",
                 ports = c(8880, 8881, 4040, 54321))
config["spark.home"] <- Sys.getenv("SPARK_HOME")
sc <- spark_connect(config = config, spark_home = Sys.getenv("SPARK_HOME"))
hc <- H2OContext.getOrCreate()
spark_disconnect(sc)

You can also submit RSparkling batch job. In that case, create a file called batch.R with the content from the code box above and run:

Rscript --default-packages=methods,utils batch.R

Note: In the case of RSparkling, SparklyR automatically sets the Spark deployment mode and it is not possible to specify it.

Manual Mode of External Backend

Sparkling Water External backend can be also used in Kubernetes. First, we need to start an external H2O backend on Kubernetes. To achieve this, please follow the steps on the H2O on Kubernetes Documentation with one important exception. The image to be used need to be h2oai/sparkling-water-external-backend:3.34.0.4-1-3.0 and not the base H2O image as mentioned in H2O documentation as Sparkling Water enhances the H2O image with additional dependencies.

In order for Sparkling Water to be able to connect to the H2O cluster, we need to get the address of the leader node of the H2O cluster. If we followed the H2O documentation on how to start H2O cluster on Kubernetes, the address is h2o-service.default.svc.cluster.local:54321 where the first part is the H2O headless service name and the second part is the name of the namespace.

After we created the external H2O backend, we can connect to it from Sparkling Water clients as:

Scala

Both cluster and client deployment modes of Kubernetes are supported.

To submit Scala job in a cluster mode, run:

$SPARK_HOME/bin/spark-submit \
--master "k8s://KUBERNETES_ENDPOINT" \
--deploy-mode cluster \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-scala:3.34.0.4-1-3.0 \
--conf spark.executor.instances=2 \
--conf spark.driver.host=sparkling-water-app \
--conf spark.kubernetes.driver.pod.name=sparkling-water-app \
--conf spark.ext.h2o.backend.cluster.mode=external \
--conf spark.ext.h2o.external.start.mode=manual \
--conf spark.ext.h2o.external.memory=2G \
--conf spark.ext.h2o.cloud.representative=h2o-service.default.svc.cluster.local:54321 \
--conf spark.ext.h2o.cloud.name=root \
--class ai.h2o.sparkling.KubernetesTest \
local:///opt/sparkling-water/tests/kubernetesTest.jar

To start an interactive shell in a client mode:

  1. Create Headless service so Spark executors can reach the driver node

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Service
metadata:
  name: sparkling-water-app
spec:
  clusterIP: "None"
  selector:
    spark-driver-selector: sparkling-water-app
EOF
  1. Start pod from where we run the shell:

kubectl run -n default -i --tty sparkling-water-app --restart=Never --labels spark-driver-selector=sparkling-water-app --image=h2oai/sparkling-water-scala:3.34.0.4-1-3.0 -- /bin/bash
  1. Inside the container, start the shell:

$SPARK_HOME/bin/spark-shell \
--master "k8s://KUBERNETES_ENDPOINT" \
--deploy-mode client \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-scala:3.34.0.4-1-3.0 \
--conf spark.executor.instances=2 \
--conf spark.driver.host=sparkling-water-app \
--conf spark.kubernetes.driver.pod.name=sparkling-water-app \
--conf spark.ext.h2o.backend.cluster.mode=external \
--conf spark.ext.h2o.external.start.mode=manual \
--conf spark.ext.h2o.external.memory=2G \
--conf spark.ext.h2o.cloud.representative=h2o-service.default.svc.cluster.local:54321 \
--conf spark.ext.h2o.cloud.name=root
  1. Inside the shell, run:

import ai.h2o.sparkling._
val hc = H2OContext.getOrCreate()
  1. To access flow, we need to enable port-forwarding from the driver pod:

kubectl port-forward sparkling-water-app 54321:54321

To submit a batch job using client mode:

First, create the headless service as mentioned in the step 1 above and run:

kubectl run -n default -i --tty sparkling-water-app --restart=Never --labels spark-driver-selector=sparkling-water-app --image=h2oai/sparkling-water-scala:3.34.0.4-1-3.0 -- \
$SPARK_HOME/bin/spark-submit \
--master "k8s://KUBERNETES_ENDPOINT" \
--deploy-mode client \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-scala:3.34.0.4-1-3.0 \
--conf spark.executor.instances=2 \
--conf spark.driver.host=sparkling-water-app \
--conf spark.kubernetes.driver.pod.name=sparkling-water-app \
--conf spark.ext.h2o.backend.cluster.mode=external \
--conf spark.ext.h2o.external.start.mode=manual \
--conf spark.ext.h2o.external.memory=2G \
--conf spark.ext.h2o.cloud.representative=h2o-service.default.svc.cluster.local:54321 \
--conf spark.ext.h2o.cloud.name=root \
--class ai.h2o.sparkling.KubernetesTest \
local:///opt/sparkling-water/tests/kubernetesTest.jar

Python

Both cluster and client deployment modes of Kubernetes are supported.

To submit Python job in a cluster mode, run:

$SPARK_HOME/bin/spark-submit \
--master "k8s://KUBERNETES_ENDPOINT" \
--deploy-mode cluster \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-python:3.34.0.4-1-3.0 \
--conf spark.executor.instances=2 \
--conf spark.driver.host=sparkling-water-app \
--conf spark.kubernetes.driver.pod.name=sparkling-water-app \
--conf spark.ext.h2o.backend.cluster.mode=external \
--conf spark.ext.h2o.external.start.mode=manual \
--conf spark.ext.h2o.external.memory=2G \
--conf spark.ext.h2o.cloud.representative=h2o-service.default.svc.cluster.local:54321 \
--conf spark.ext.h2o.cloud.name=root \
local:///opt/sparkling-water/tests/initTest.py

To start an interactive shell in a client mode:

  1. Create Headless service so Spark executors can reach the driver node:

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Service
metadata:
  name: sparkling-water-app
spec:
  clusterIP: "None"
  selector:
    spark-driver-selector: sparkling-water-app
EOF
  1. Start pod from where we run the shell:

kubectl run -n default -i --tty sparkling-water-app --restart=Never --labels spark-driver-selector=sparkling-water-app --image=h2oai/sparkling-water-python:3.34.0.4-1-3.0 -- /bin/bash
  1. Inside the container, start the shell:

$SPARK_HOME/bin/pyspark \
--master "k8s://KUBERNETES_ENDPOINT" \
--deploy-mode client \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-python:3.34.0.4-1-3.0 \
--conf spark.executor.instances=2 \
--conf spark.driver.host=sparkling-water-app \
--conf spark.kubernetes.driver.pod.name=sparkling-water-app \
--conf spark.ext.h2o.backend.cluster.mode=external \
--conf spark.ext.h2o.external.start.mode=manual \
--conf spark.ext.h2o.external.memory=2G \
--conf spark.ext.h2o.cloud.representative=h2o-service.default.svc.cluster.local:54321 \
--conf spark.ext.h2o.cloud.name=root
  1. Inside the shell, run:

from pysparkling import *
hc = H2OContext.getOrCreate()
  1. To access flow, we need to enable port-forwarding from the driver pod as:

kubectl port-forward sparkling-water-app 54321:54321

To submit a batch job using client mode:

First, create the headless service as mentioned in the step 1 above and run:

kubectl run -n default -i --tty sparkling-water-app --restart=Never --labels spark-driver-selector=sparkling-water-app --image=h2oai/sparkling-water-python:3.34.0.4-1-3.0 -- \
$SPARK_HOME/bin/spark-submit \
--master "k8s://KUBERNETES_ENDPOINT" \
--deploy-mode client \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-python:3.34.0.4-1-3.0 \
--conf spark.executor.instances=2 \
--conf spark.driver.host=sparkling-water-app \
--conf spark.kubernetes.driver.pod.name=sparkling-water-app \
--conf spark.ext.h2o.backend.cluster.mode=external \
--conf spark.ext.h2o.external.start.mode=manual \
--conf spark.ext.h2o.external.memory=2G \
--conf spark.ext.h2o.cloud.representative=h2o-service.default.svc.cluster.local:54321 \
--conf spark.ext.h2o.cloud.name=root \
local:///opt/sparkling-water/tests/initTest.py

R

First, make sure that RSparkling is installed on the node we want to run RSparkling from. You can install RSparkling as:

# Download, install, and initialize the H2O package for R.
# In this case we are using rel-zizler 4 (3.34.0.4)
install.packages("h2o", type = "source", repos = "http://h2o-release.s3.amazonaws.com/h2o/rel-zizler/4/R")

# Download, install, and initialize the RSparkling
install.packages("rsparkling", type = "source", repos = "http://h2o-release.s3.amazonaws.com/sparkling-water/spark-3.0/3.34.0.4-1-3.0/R")

To start H2OContext in an interactive shell, run the following code in R or RStudio:

library(sparklyr)
library(rsparkling)
config <- spark_config_kubernetes("k8s://KUBERNETES_ENDPOINT",
                 image = "h2oai/sparkling-water-r:3.34.0.4-1-3.0",
                 account = "default",
                 executors = 2,
                 version = "3.0.3",
                 conf = list(
                         "spark.ext.h2o.backend.cluster.mode"="external",
                         "spark.ext.h2o.external.start.mode"="manual",
                         "spark.ext.h2o.external.memory"="2G",
                         "spark.ext.h2o.cloud.representative"="h2o-service.default.svc.cluster.local:54321",
                         "spark.ext.h2o.cloud.name"="root",
                         "spark.kubernetes.file.upload.path"="file:///tmp"),
                 ports = c(8880, 8881, 4040, 54321))
config["spark.home"] <- Sys.getenv("SPARK_HOME")
sc <- spark_connect(config = config, spark_home = Sys.getenv("SPARK_HOME"))
hc <- H2OContext.getOrCreate()
spark_disconnect(sc)

You can also submit RSparkling batch job. In that case, create a file called batch.R with the content from the code box above and run:

Rscript --default-packages=methods,utils batch.R

Note: In the case of RSparkling, SparklyR automatically sets the Spark deployment mode and it is not possible to specify it.

Automatic Mode of External Backend

In the automatic mode, Sparkling Water starts external H2O on Kubernetes automatically. The requirement is that the driver node is configured to communicate with the Kubernetes cluster. Docker image for the external H2O backend is specified using the spark.ext.h2o.external.k8s.docker.image option.

Scala

Both cluster and client deployment modes of Kubernetes are supported.

To submit Scala job in a cluster mode, run:

$SPARK_HOME/bin/spark-submit \
--master "k8s://KUBERNETES_ENDPOINT" \
--deploy-mode cluster \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-scala:3.34.0.4-1-3.0 \
--conf spark.executor.instances=2 \
--conf spark.driver.host=sparkling-water-app \
--conf spark.kubernetes.driver.pod.name=sparkling-water-app \
--conf spark.ext.h2o.backend.cluster.mode=external \
--conf spark.ext.h2o.external.start.mode=auto \
--conf spark.ext.h2o.external.auto.start.backend=kubernetes \
--conf spark.ext.h2o.external.cluster.size=2 \
--conf spark.ext.h2o.external.memory=2G \
--conf spark.ext.h2o.external.k8s.docker.image=h2oai/sparkling-water-external-backend:3.34.0.4-1-3.0 \
--class ai.h2o.sparkling.KubernetesTest \
local:///opt/sparkling-water/tests/kubernetesTest.jar

To start an interactive shell in a client mode:

  1. Create Headless service so Spark executors can reach the driver node

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Service
metadata:
  name: sparkling-water-app
spec:
  clusterIP: "None"
  selector:
    spark-driver-selector: sparkling-water-app
EOF
  1. Start pod from where we run the shell:

kubectl run -n default -i --tty sparkling-water-app --restart=Never --labels spark-driver-selector=sparkling-water-app --image=h2oai/sparkling-water-scala:3.34.0.4-1-3.0 -- /bin/bash
  1. Inside the container, start the shell:

$SPARK_HOME/bin/spark-shell \
--master "k8s://KUBERNETES_ENDPOINT" \
--deploy-mode client \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-scala:3.34.0.4-1-3.0 \
--conf spark.executor.instances=2 \
--conf spark.driver.host=sparkling-water-app \
--conf spark.kubernetes.driver.pod.name=sparkling-water-app \
--conf spark.ext.h2o.backend.cluster.mode=external \
--conf spark.ext.h2o.external.start.mode=auto \
--conf spark.ext.h2o.external.auto.start.backend=kubernetes \
--conf spark.ext.h2o.external.cluster.size=2 \
--conf spark.ext.h2o.external.memory=2G \
--conf spark.ext.h2o.external.k8s.docker.image=h2oai/sparkling-water-external-backend:3.34.0.4-1-3.0
  1. Inside the shell, run:

import ai.h2o.sparkling._
val hc = H2OContext.getOrCreate()
  1. To access flow, we need to enable port-forwarding from the driver pod:

kubectl port-forward sparkling-water-app 54321:54321

To submit a batch job using client mode:

First, create the headless service as mentioned in the step 1 above and run:

kubectl run -n default -i --tty sparkling-water-app --restart=Never --labels spark-driver-selector=sparkling-water-app --image=h2oai/sparkling-water-scala:3.34.0.4-1-3.0 -- \
$SPARK_HOME/bin/spark-submit \
--master "k8s://KUBERNETES_ENDPOINT" \
--deploy-mode client \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-scala:3.34.0.4-1-3.0 \
--conf spark.executor.instances=2 \
--conf spark.driver.host=sparkling-water-app \
--conf spark.kubernetes.driver.pod.name=sparkling-water-app \
--conf spark.ext.h2o.backend.cluster.mode=external \
--conf spark.ext.h2o.external.start.mode=auto \
--conf spark.ext.h2o.external.auto.start.backend=kubernetes \
--conf spark.ext.h2o.external.cluster.size=2 \
--conf spark.ext.h2o.external.memory=2G \
--conf spark.ext.h2o.external.k8s.docker.image=h2oai/sparkling-water-external-backend:3.34.0.4-1-3.0 \
--class ai.h2o.sparkling.KubernetesTest \
local:///opt/sparkling-water/tests/kubernetesTest.jar

Python

Both cluster and client deployment modes of Kubernetes are supported.

To submit Python job in a cluster mode, run:

$SPARK_HOME/bin/spark-submit \
--master "k8s://KUBERNETES_ENDPOINT" \
--deploy-mode cluster \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-python:3.34.0.4-1-3.0 \
--conf spark.executor.instances=2 \
--conf spark.driver.host=sparkling-water-app \
--conf spark.kubernetes.driver.pod.name=sparkling-water-app \
--conf spark.ext.h2o.backend.cluster.mode=external \
--conf spark.ext.h2o.external.start.mode=auto \
--conf spark.ext.h2o.external.auto.start.backend=kubernetes \
--conf spark.ext.h2o.external.cluster.size=2 \
--conf spark.ext.h2o.external.memory=2G \
--conf spark.ext.h2o.external.k8s.docker.image=h2oai/sparkling-water-external-backend:3.34.0.4-1-3.0 \
local:///opt/sparkling-water/tests/initTest.py

To start an interactive shell in a client mode:

  1. Create Headless service so Spark executors can reach the driver node:

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Service
metadata:
  name: sparkling-water-app
spec:
  clusterIP: "None"
  selector:
    spark-driver-selector: sparkling-water-app
EOF
  1. Start pod from where we run the shell:

kubectl run -n default -i --tty sparkling-water-app --restart=Never --labels spark-driver-selector=sparkling-water-app --image=h2oai/sparkling-water-python:3.34.0.4-1-3.0 -- /bin/bash
  1. Inside the container, start the shell:

$SPARK_HOME/bin/pyspark \
--master "k8s://KUBERNETES_ENDPOINT" \
--deploy-mode client \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-python:3.34.0.4-1-3.0 \
--conf spark.executor.instances=2 \
--conf spark.driver.host=sparkling-water-app \
--conf spark.kubernetes.driver.pod.name=sparkling-water-app \
--conf spark.ext.h2o.backend.cluster.mode=external \
--conf spark.ext.h2o.external.start.mode=auto \
--conf spark.ext.h2o.external.auto.start.backend=kubernetes \
--conf spark.ext.h2o.external.cluster.size=2 \
--conf spark.ext.h2o.external.memory=2G \
--conf spark.ext.h2o.external.k8s.docker.image=h2oai/sparkling-water-external-backend:3.34.0.4-1-3.0
  1. Inside the shell, run:

from pysparkling import *
hc = H2OContext.getOrCreate()
  1. To access flow, we need to enable port-forwarding from the driver pod as:

kubectl port-forward sparkling-water-app 54321:54321

To submit a batch job using client mode:

First, create the headless service as mentioned in the step 1 above and run:

kubectl run -n default -i --tty sparkling-water-app --restart=Never --labels spark-driver-selector=sparkling-water-app --image=h2oai/sparkling-water-python:3.34.0.4-1-3.0 -- \
$SPARK_HOME/bin/spark-submit \
--master "k8s://KUBERNETES_ENDPOINT" \
--deploy-mode client \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-python:3.34.0.4-1-3.0 \
--conf spark.executor.instances=2 \
--conf spark.driver.host=sparkling-water-app \
--conf spark.kubernetes.driver.pod.name=sparkling-water-app \
--conf spark.ext.h2o.backend.cluster.mode=external \
--conf spark.ext.h2o.external.start.mode=auto \
--conf spark.ext.h2o.external.auto.start.backend=kubernetes \
--conf spark.ext.h2o.external.cluster.size=2 \
--conf spark.ext.h2o.external.memory=2G \
--conf spark.ext.h2o.external.k8s.docker.image=h2oai/sparkling-water-external-backend:3.34.0.4-1-3.0 \
local:///opt/sparkling-water/tests/initTest.py

R

First, make sure that RSparkling is installed on the node we want to run RSparkling from. You can install RSparkling as:

# Download, install, and initialize the H2O package for R.
# In this case we are using rel-zizler 4 (3.34.0.4)
install.packages("h2o", type = "source", repos = "http://h2o-release.s3.amazonaws.com/h2o/rel-zizler/4/R")

# Download, install, and initialize the RSparkling
install.packages("rsparkling", type = "source", repos = "http://h2o-release.s3.amazonaws.com/sparkling-water/spark-3.0/3.34.0.4-1-3.0/R")

To start H2OContext in an interactive shell, run the following code in R or RStudio:

library(sparklyr)
library(rsparkling)
config <- spark_config_kubernetes("k8s://KUBERNETES_ENDPOINT",
                 image = "h2oai/sparkling-water-r:3.34.0.4-1-3.0",
                 account = "default",
                 executors = 2,
                 version = "3.0.3",
                 conf = list(
                         "spark.ext.h2o.backend.cluster.mode"="external",
                         "spark.ext.h2o.external.start.mode"="auto",
                         "spark.ext.h2o.external.auto.start.backend"="kubernetes",
                         "spark.ext.h2o.external.memory"="2G",
                         "spark.ext.h2o.external.cluster.size"="2",
                         "spark.ext.h2o.external.k8s.docker.image"="h2oai/sparkling-water-external-backend:3.34.0.4-1-3.0",
                         "spark.kubernetes.file.upload.path"="file:///tmp"),
                 ports = c(8880, 8881, 4040, 54321))
config["spark.home"] <- Sys.getenv("SPARK_HOME")
sc <- spark_connect(config = config, spark_home = Sys.getenv("SPARK_HOME"))
hc <- H2OContext.getOrCreate()
spark_disconnect(sc)

You can also submit RSparkling batch job. In that case, create a file called batch.R with the content from the code box above and run:

Rscript --default-packages=methods,utils batch.R

Note: In the case of RSparkling, SparklyR automatically sets the Spark deployment mode and it is not possible to specify it.