Running Sparkling Water in Kubernetes¶
Sparkling Water can be executed inside the Kubernetes cluster. Sparkling Water supports Kubernetes since Spark version 2.4.
Before we start, please check the following:
Please make sure we are familiar with how to run Spark on Kubernetes at Spark Kubernetes documentation.
Ensure that we have a working Kubernetes Cluster and
kubectl
installedEnsure we have
SPARK_HOME
set up to a home directory of our Spark distribution of version 3.0.3Run
kubectl cluster-info
to obtain Kubernetes master URL.Have internet connection so Kubernetes can download Sparkling Water docker images
If we have some non-default network policies applied to the namespace where Sparkling Water is supposed to run, make sure that the following ports are exposed: all Spark ports and ports 54321 and 54322 as these are also necessary by H2O to be able to communicate.
The examples below are using the default Kubernetes namespace which we enable for Spark as:
kubectl create clusterrolebinding default --clusterrole=edit --serviceaccount=default:default --namespace=default
We can also use different namespace setup for Spark. In that case please don’t forget to pass
--conf spark.kubernetes.authenticate.driver.serviceAccountName=serviceName
to our Spark commands.
Internal Backend¶
In the internal backend of Sparkling Water, we need to pass the option spark.scheduler.minRegisteredResourcesRatio=1
to our Spark job invocation. This ensures that Spark waits for all resources and therefore Sparkling Water will
start H2O on all requested executors.
Dynamic allocation must be disabled in Spark.
Scala
Both cluster and client deployment modes of Kubernetes are supported.
To submit Scala job in a cluster mode, run:
$SPARK_HOME/bin/spark-submit \
--master "k8s://KUBERNETES_ENDPOINT" \
--deploy-mode cluster \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-scala:3.38.0.4-1-3.0 \
--conf spark.executor.instances=3 \
--conf spark.driver.host=sparkling-water-app \
--conf spark.kubernetes.driver.pod.name=sparkling-water-app \
--class ai.h2o.sparkling.KubernetesTest \
local:///opt/sparkling-water/tests/kubernetesTest.jar
To start an interactive shell in a client mode:
Create Headless service so Spark executors can reach the driver node
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Service
metadata:
name: sparkling-water-app
spec:
clusterIP: "None"
selector:
spark-driver-selector: sparkling-water-app
EOF
Start pod from where we run the shell:
kubectl run -n default -i --tty sparkling-water-app --restart=Never --labels spark-driver-selector=sparkling-water-app --image=h2oai/sparkling-water-scala:3.38.0.4-1-3.0 -- /bin/bash
Inside the container, start the shell:
$SPARK_HOME/bin/spark-shell \
--master "k8s://KUBERNETES_ENDPOINT" \
--deploy-mode client \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-scala:3.38.0.4-1-3.0 \
--conf spark.executor.instances=3 \
--conf spark.driver.host=sparkling-water-app \
--conf spark.kubernetes.driver.pod.name=sparkling-water-app
Inside the shell, run:
import ai.h2o.sparkling._
val hc = H2OContext.getOrCreate()
To access flow, we need to enable port-forwarding from the driver pod:
kubectl port-forward sparkling-water-app 54321:54321
To submit a batch job using client mode:
First, create the headless service as mentioned in the step 1 above and run:
kubectl run -n default -i --tty sparkling-water-app --restart=Never --labels spark-driver-selector=sparkling-water-app --image=h2oai/sparkling-water-scala:3.38.0.4-1-3.0 -- \
$SPARK_HOME/bin/spark-submit \
--master "k8s://KUBERNETES_ENDPOINT" \
--deploy-mode client \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-scala:3.38.0.4-1-3.0 \
--conf spark.executor.instances=3 \
--conf spark.driver.host=sparkling-water-app \
--conf spark.kubernetes.driver.pod.name=sparkling-water-app \
--class ai.h2o.sparkling.KubernetesTest \
local:///opt/sparkling-water/tests/kubernetesTest.jar
Python
Both cluster and client deployment modes of Kubernetes are supported.
To submit Python job in a cluster mode, run:
$SPARK_HOME/bin/spark-submit \
--master "k8s://KUBERNETES_ENDPOINT" \
--deploy-mode cluster \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-python:3.38.0.4-1-3.0 \
--conf spark.executor.instances=3 \
--conf spark.driver.host=sparkling-water-app \
--conf spark.kubernetes.driver.pod.name=sparkling-water-app \
local:///opt/sparkling-water/tests/initTest.py
To start an interactive shell in a client mode:
Create Headless service so Spark executors can reach the driver node:
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Service
metadata:
name: sparkling-water-app
spec:
clusterIP: "None"
selector:
spark-driver-selector: sparkling-water-app
EOF
Start pod from where we run the shell:
kubectl run -n default -i --tty sparkling-water-app --restart=Never --labels spark-driver-selector=sparkling-water-app --image=h2oai/sparkling-water-python:3.38.0.4-1-3.0 -- /bin/bash
Inside the container, start the shell:
$SPARK_HOME/bin/pyspark \
--master "k8s://KUBERNETES_ENDPOINT" \
--deploy-mode client \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-python:3.38.0.4-1-3.0 \
--conf spark.executor.instances=3 \
--conf spark.driver.host=sparkling-water-app \
--conf spark.kubernetes.driver.pod.name=sparkling-water-app
Inside the shell, run:
from pysparkling import *
hc = H2OContext.getOrCreate()
To access flow, we need to enable port-forwarding from the driver pod as:
kubectl port-forward sparkling-water-app 54321:54321
To submit a batch job using client mode:
First, create the headless service as mentioned in the step 1 above and run:
kubectl run -n default -i --tty sparkling-water-app --restart=Never --labels spark-driver-selector=sparkling-water-app --image=h2oai/sparkling-water-python:3.38.0.4-1-3.0 -- \
$SPARK_HOME/bin/spark-submit \
--master "k8s://KUBERNETES_ENDPOINT" \
--deploy-mode client \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-python:3.38.0.4-1-3.0 \
--conf spark.executor.instances=3 \
--conf spark.driver.host=sparkling-water-app \
--conf spark.kubernetes.driver.pod.name=sparkling-water-app \
local:///opt/sparkling-water/tests/initTest.py
R
First, make sure that RSparkling is installed on the node we want to run RSparkling from. You can install RSparkling as:
# Download, install, and initialize the H2O package for R.
# In this case we are using rel-zygmund 4 (3.38.0.4)
install.packages("h2o", type = "source", repos = "http://h2o-release.s3.amazonaws.com/h2o/rel-zygmund/4/R")
# Download, install, and initialize the RSparkling
install.packages("rsparkling", type = "source", repos = "http://h2o-release.s3.amazonaws.com/sparkling-water/spark-3.0/3.38.0.4-1-3.0/R")
To start H2OContext
in an interactive shell, run the following code in R or RStudio:
library(sparklyr)
library(rsparkling)
config <- spark_config_kubernetes("k8s://KUBERNETES_ENDPOINT",
image = "h2oai/sparkling-water-r:3.38.0.4-1-3.0",
account = "default",
executors = 3,
conf = list("spark.kubernetes.file.upload.path"="file:///tmp"),
version = "3.0.3",
ports = c(8880, 8881, 4040, 54321))
config["spark.home"] <- Sys.getenv("SPARK_HOME")
sc <- spark_connect(config = config, spark_home = Sys.getenv("SPARK_HOME"))
hc <- H2OContext.getOrCreate()
spark_disconnect(sc)
You can also submit RSparkling batch job. In that case, create a file called batch.R with the content from the code box above and run:
Rscript --default-packages=methods,utils batch.R
Note: In the case of RSparkling, SparklyR automatically sets the Spark deployment mode and it is not possible to specify it.
Manual Mode of External Backend¶
Sparkling Water External backend can be also used in Kubernetes. First, we need to start an external H2O backend on Kubernetes. To achieve this, please follow the steps on the H2O on Kubernetes Documentation with one important exception. The image to be used need to be h2oai/sparkling-water-external-backend:3.38.0.4-1-3.0 and not the base H2O image as mentioned in H2O documentation as Sparkling Water enhances the H2O image with additional dependencies.
In order for Sparkling Water to be able to connect to the H2O cluster, we need to get the address of the leader node
of the H2O cluster. If we followed the H2O documentation on how to start H2O cluster on Kubernetes, the address is
h2o-service.default.svc.cluster.local:54321
where the first part is the H2O headless service name and the second part is the name
of the namespace.
After we created the external H2O backend, we can connect to it from Sparkling Water clients as:
Scala
Both cluster and client deployment modes of Kubernetes are supported.
To submit Scala job in a cluster mode, run:
$SPARK_HOME/bin/spark-submit \
--master "k8s://KUBERNETES_ENDPOINT" \
--deploy-mode cluster \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-scala:3.38.0.4-1-3.0 \
--conf spark.executor.instances=2 \
--conf spark.driver.host=sparkling-water-app \
--conf spark.kubernetes.driver.pod.name=sparkling-water-app \
--conf spark.ext.h2o.backend.cluster.mode=external \
--conf spark.ext.h2o.external.start.mode=manual \
--conf spark.ext.h2o.external.memory=2G \
--conf spark.ext.h2o.cloud.representative=h2o-service.default.svc.cluster.local:54321 \
--conf spark.ext.h2o.cloud.name=root \
--class ai.h2o.sparkling.KubernetesTest \
local:///opt/sparkling-water/tests/kubernetesTest.jar
To start an interactive shell in a client mode:
Create Headless service so Spark executors can reach the driver node
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Service
metadata:
name: sparkling-water-app
spec:
clusterIP: "None"
selector:
spark-driver-selector: sparkling-water-app
EOF
Start pod from where we run the shell:
kubectl run -n default -i --tty sparkling-water-app --restart=Never --labels spark-driver-selector=sparkling-water-app --image=h2oai/sparkling-water-scala:3.38.0.4-1-3.0 -- /bin/bash
Inside the container, start the shell:
$SPARK_HOME/bin/spark-shell \
--master "k8s://KUBERNETES_ENDPOINT" \
--deploy-mode client \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-scala:3.38.0.4-1-3.0 \
--conf spark.executor.instances=2 \
--conf spark.driver.host=sparkling-water-app \
--conf spark.kubernetes.driver.pod.name=sparkling-water-app \
--conf spark.ext.h2o.backend.cluster.mode=external \
--conf spark.ext.h2o.external.start.mode=manual \
--conf spark.ext.h2o.external.memory=2G \
--conf spark.ext.h2o.cloud.representative=h2o-service.default.svc.cluster.local:54321 \
--conf spark.ext.h2o.cloud.name=root
Inside the shell, run:
import ai.h2o.sparkling._
val hc = H2OContext.getOrCreate()
To access flow, we need to enable port-forwarding from the driver pod:
kubectl port-forward sparkling-water-app 54321:54321
To submit a batch job using client mode:
First, create the headless service as mentioned in the step 1 above and run:
kubectl run -n default -i --tty sparkling-water-app --restart=Never --labels spark-driver-selector=sparkling-water-app --image=h2oai/sparkling-water-scala:3.38.0.4-1-3.0 -- \
$SPARK_HOME/bin/spark-submit \
--master "k8s://KUBERNETES_ENDPOINT" \
--deploy-mode client \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-scala:3.38.0.4-1-3.0 \
--conf spark.executor.instances=2 \
--conf spark.driver.host=sparkling-water-app \
--conf spark.kubernetes.driver.pod.name=sparkling-water-app \
--conf spark.ext.h2o.backend.cluster.mode=external \
--conf spark.ext.h2o.external.start.mode=manual \
--conf spark.ext.h2o.external.memory=2G \
--conf spark.ext.h2o.cloud.representative=h2o-service.default.svc.cluster.local:54321 \
--conf spark.ext.h2o.cloud.name=root \
--class ai.h2o.sparkling.KubernetesTest \
local:///opt/sparkling-water/tests/kubernetesTest.jar
Python
Both cluster and client deployment modes of Kubernetes are supported.
To submit Python job in a cluster mode, run:
$SPARK_HOME/bin/spark-submit \
--master "k8s://KUBERNETES_ENDPOINT" \
--deploy-mode cluster \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-python:3.38.0.4-1-3.0 \
--conf spark.executor.instances=2 \
--conf spark.driver.host=sparkling-water-app \
--conf spark.kubernetes.driver.pod.name=sparkling-water-app \
--conf spark.ext.h2o.backend.cluster.mode=external \
--conf spark.ext.h2o.external.start.mode=manual \
--conf spark.ext.h2o.external.memory=2G \
--conf spark.ext.h2o.cloud.representative=h2o-service.default.svc.cluster.local:54321 \
--conf spark.ext.h2o.cloud.name=root \
local:///opt/sparkling-water/tests/initTest.py
To start an interactive shell in a client mode:
Create Headless service so Spark executors can reach the driver node:
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Service
metadata:
name: sparkling-water-app
spec:
clusterIP: "None"
selector:
spark-driver-selector: sparkling-water-app
EOF
Start pod from where we run the shell:
kubectl run -n default -i --tty sparkling-water-app --restart=Never --labels spark-driver-selector=sparkling-water-app --image=h2oai/sparkling-water-python:3.38.0.4-1-3.0 -- /bin/bash
Inside the container, start the shell:
$SPARK_HOME/bin/pyspark \
--master "k8s://KUBERNETES_ENDPOINT" \
--deploy-mode client \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-python:3.38.0.4-1-3.0 \
--conf spark.executor.instances=2 \
--conf spark.driver.host=sparkling-water-app \
--conf spark.kubernetes.driver.pod.name=sparkling-water-app \
--conf spark.ext.h2o.backend.cluster.mode=external \
--conf spark.ext.h2o.external.start.mode=manual \
--conf spark.ext.h2o.external.memory=2G \
--conf spark.ext.h2o.cloud.representative=h2o-service.default.svc.cluster.local:54321 \
--conf spark.ext.h2o.cloud.name=root
Inside the shell, run:
from pysparkling import *
hc = H2OContext.getOrCreate()
To access flow, we need to enable port-forwarding from the driver pod as:
kubectl port-forward sparkling-water-app 54321:54321
To submit a batch job using client mode:
First, create the headless service as mentioned in the step 1 above and run:
kubectl run -n default -i --tty sparkling-water-app --restart=Never --labels spark-driver-selector=sparkling-water-app --image=h2oai/sparkling-water-python:3.38.0.4-1-3.0 -- \
$SPARK_HOME/bin/spark-submit \
--master "k8s://KUBERNETES_ENDPOINT" \
--deploy-mode client \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-python:3.38.0.4-1-3.0 \
--conf spark.executor.instances=2 \
--conf spark.driver.host=sparkling-water-app \
--conf spark.kubernetes.driver.pod.name=sparkling-water-app \
--conf spark.ext.h2o.backend.cluster.mode=external \
--conf spark.ext.h2o.external.start.mode=manual \
--conf spark.ext.h2o.external.memory=2G \
--conf spark.ext.h2o.cloud.representative=h2o-service.default.svc.cluster.local:54321 \
--conf spark.ext.h2o.cloud.name=root \
local:///opt/sparkling-water/tests/initTest.py
R
First, make sure that RSparkling is installed on the node we want to run RSparkling from. You can install RSparkling as:
# Download, install, and initialize the H2O package for R.
# In this case we are using rel-zygmund 4 (3.38.0.4)
install.packages("h2o", type = "source", repos = "http://h2o-release.s3.amazonaws.com/h2o/rel-zygmund/4/R")
# Download, install, and initialize the RSparkling
install.packages("rsparkling", type = "source", repos = "http://h2o-release.s3.amazonaws.com/sparkling-water/spark-3.0/3.38.0.4-1-3.0/R")
To start H2OContext
in an interactive shell, run the following code in R or RStudio:
library(sparklyr)
library(rsparkling)
config <- spark_config_kubernetes("k8s://KUBERNETES_ENDPOINT",
image = "h2oai/sparkling-water-r:3.38.0.4-1-3.0",
account = "default",
executors = 2,
version = "3.0.3",
conf = list(
"spark.ext.h2o.backend.cluster.mode"="external",
"spark.ext.h2o.external.start.mode"="manual",
"spark.ext.h2o.external.memory"="2G",
"spark.ext.h2o.cloud.representative"="h2o-service.default.svc.cluster.local:54321",
"spark.ext.h2o.cloud.name"="root",
"spark.kubernetes.file.upload.path"="file:///tmp"),
ports = c(8880, 8881, 4040, 54321))
config["spark.home"] <- Sys.getenv("SPARK_HOME")
sc <- spark_connect(config = config, spark_home = Sys.getenv("SPARK_HOME"))
hc <- H2OContext.getOrCreate()
spark_disconnect(sc)
You can also submit RSparkling batch job. In that case, create a file called batch.R with the content from the code box above and run:
Rscript --default-packages=methods,utils batch.R
Note: In the case of RSparkling, SparklyR automatically sets the Spark deployment mode and it is not possible to specify it.
Automatic Mode of External Backend¶
In the automatic mode, Sparkling Water starts external H2O on Kubernetes automatically. The requirement is that the
driver node is configured to communicate with the Kubernetes cluster. Docker image for the external H2O backend
is specified using the spark.ext.h2o.external.k8s.docker.image
option.
Scala
Both cluster and client deployment modes of Kubernetes are supported.
To submit Scala job in a cluster mode, run:
$SPARK_HOME/bin/spark-submit \
--master "k8s://KUBERNETES_ENDPOINT" \
--deploy-mode cluster \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-scala:3.38.0.4-1-3.0 \
--conf spark.executor.instances=2 \
--conf spark.driver.host=sparkling-water-app \
--conf spark.kubernetes.driver.pod.name=sparkling-water-app \
--conf spark.ext.h2o.backend.cluster.mode=external \
--conf spark.ext.h2o.external.start.mode=auto \
--conf spark.ext.h2o.external.auto.start.backend=kubernetes \
--conf spark.ext.h2o.external.cluster.size=2 \
--conf spark.ext.h2o.external.memory=2G \
--conf spark.ext.h2o.external.k8s.docker.image=h2oai/sparkling-water-external-backend:3.38.0.4-1-3.0 \
--class ai.h2o.sparkling.KubernetesTest \
local:///opt/sparkling-water/tests/kubernetesTest.jar
To start an interactive shell in a client mode:
Create Headless service so Spark executors can reach the driver node
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Service
metadata:
name: sparkling-water-app
spec:
clusterIP: "None"
selector:
spark-driver-selector: sparkling-water-app
EOF
Start pod from where we run the shell:
kubectl run -n default -i --tty sparkling-water-app --restart=Never --labels spark-driver-selector=sparkling-water-app --image=h2oai/sparkling-water-scala:3.38.0.4-1-3.0 -- /bin/bash
Inside the container, start the shell:
$SPARK_HOME/bin/spark-shell \
--master "k8s://KUBERNETES_ENDPOINT" \
--deploy-mode client \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-scala:3.38.0.4-1-3.0 \
--conf spark.executor.instances=2 \
--conf spark.driver.host=sparkling-water-app \
--conf spark.kubernetes.driver.pod.name=sparkling-water-app \
--conf spark.ext.h2o.backend.cluster.mode=external \
--conf spark.ext.h2o.external.start.mode=auto \
--conf spark.ext.h2o.external.auto.start.backend=kubernetes \
--conf spark.ext.h2o.external.cluster.size=2 \
--conf spark.ext.h2o.external.memory=2G \
--conf spark.ext.h2o.external.k8s.docker.image=h2oai/sparkling-water-external-backend:3.38.0.4-1-3.0
Inside the shell, run:
import ai.h2o.sparkling._
val hc = H2OContext.getOrCreate()
To access flow, we need to enable port-forwarding from the driver pod:
kubectl port-forward sparkling-water-app 54321:54321
To submit a batch job using client mode:
First, create the headless service as mentioned in the step 1 above and run:
kubectl run -n default -i --tty sparkling-water-app --restart=Never --labels spark-driver-selector=sparkling-water-app --image=h2oai/sparkling-water-scala:3.38.0.4-1-3.0 -- \
$SPARK_HOME/bin/spark-submit \
--master "k8s://KUBERNETES_ENDPOINT" \
--deploy-mode client \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-scala:3.38.0.4-1-3.0 \
--conf spark.executor.instances=2 \
--conf spark.driver.host=sparkling-water-app \
--conf spark.kubernetes.driver.pod.name=sparkling-water-app \
--conf spark.ext.h2o.backend.cluster.mode=external \
--conf spark.ext.h2o.external.start.mode=auto \
--conf spark.ext.h2o.external.auto.start.backend=kubernetes \
--conf spark.ext.h2o.external.cluster.size=2 \
--conf spark.ext.h2o.external.memory=2G \
--conf spark.ext.h2o.external.k8s.docker.image=h2oai/sparkling-water-external-backend:3.38.0.4-1-3.0 \
--class ai.h2o.sparkling.KubernetesTest \
local:///opt/sparkling-water/tests/kubernetesTest.jar
Python
Both cluster and client deployment modes of Kubernetes are supported.
To submit Python job in a cluster mode, run:
$SPARK_HOME/bin/spark-submit \
--master "k8s://KUBERNETES_ENDPOINT" \
--deploy-mode cluster \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-python:3.38.0.4-1-3.0 \
--conf spark.executor.instances=2 \
--conf spark.driver.host=sparkling-water-app \
--conf spark.kubernetes.driver.pod.name=sparkling-water-app \
--conf spark.ext.h2o.backend.cluster.mode=external \
--conf spark.ext.h2o.external.start.mode=auto \
--conf spark.ext.h2o.external.auto.start.backend=kubernetes \
--conf spark.ext.h2o.external.cluster.size=2 \
--conf spark.ext.h2o.external.memory=2G \
--conf spark.ext.h2o.external.k8s.docker.image=h2oai/sparkling-water-external-backend:3.38.0.4-1-3.0 \
local:///opt/sparkling-water/tests/initTest.py
To start an interactive shell in a client mode:
Create Headless service so Spark executors can reach the driver node:
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Service
metadata:
name: sparkling-water-app
spec:
clusterIP: "None"
selector:
spark-driver-selector: sparkling-water-app
EOF
Start pod from where we run the shell:
kubectl run -n default -i --tty sparkling-water-app --restart=Never --labels spark-driver-selector=sparkling-water-app --image=h2oai/sparkling-water-python:3.38.0.4-1-3.0 -- /bin/bash
Inside the container, start the shell:
$SPARK_HOME/bin/pyspark \
--master "k8s://KUBERNETES_ENDPOINT" \
--deploy-mode client \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-python:3.38.0.4-1-3.0 \
--conf spark.executor.instances=2 \
--conf spark.driver.host=sparkling-water-app \
--conf spark.kubernetes.driver.pod.name=sparkling-water-app \
--conf spark.ext.h2o.backend.cluster.mode=external \
--conf spark.ext.h2o.external.start.mode=auto \
--conf spark.ext.h2o.external.auto.start.backend=kubernetes \
--conf spark.ext.h2o.external.cluster.size=2 \
--conf spark.ext.h2o.external.memory=2G \
--conf spark.ext.h2o.external.k8s.docker.image=h2oai/sparkling-water-external-backend:3.38.0.4-1-3.0
Inside the shell, run:
from pysparkling import *
hc = H2OContext.getOrCreate()
To access flow, we need to enable port-forwarding from the driver pod as:
kubectl port-forward sparkling-water-app 54321:54321
To submit a batch job using client mode:
First, create the headless service as mentioned in the step 1 above and run:
kubectl run -n default -i --tty sparkling-water-app --restart=Never --labels spark-driver-selector=sparkling-water-app --image=h2oai/sparkling-water-python:3.38.0.4-1-3.0 -- \
$SPARK_HOME/bin/spark-submit \
--master "k8s://KUBERNETES_ENDPOINT" \
--deploy-mode client \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-python:3.38.0.4-1-3.0 \
--conf spark.executor.instances=2 \
--conf spark.driver.host=sparkling-water-app \
--conf spark.kubernetes.driver.pod.name=sparkling-water-app \
--conf spark.ext.h2o.backend.cluster.mode=external \
--conf spark.ext.h2o.external.start.mode=auto \
--conf spark.ext.h2o.external.auto.start.backend=kubernetes \
--conf spark.ext.h2o.external.cluster.size=2 \
--conf spark.ext.h2o.external.memory=2G \
--conf spark.ext.h2o.external.k8s.docker.image=h2oai/sparkling-water-external-backend:3.38.0.4-1-3.0 \
local:///opt/sparkling-water/tests/initTest.py
R
First, make sure that RSparkling is installed on the node we want to run RSparkling from. You can install RSparkling as:
# Download, install, and initialize the H2O package for R.
# In this case we are using rel-zygmund 4 (3.38.0.4)
install.packages("h2o", type = "source", repos = "http://h2o-release.s3.amazonaws.com/h2o/rel-zygmund/4/R")
# Download, install, and initialize the RSparkling
install.packages("rsparkling", type = "source", repos = "http://h2o-release.s3.amazonaws.com/sparkling-water/spark-3.0/3.38.0.4-1-3.0/R")
To start H2OContext
in an interactive shell, run the following code in R or RStudio:
library(sparklyr)
library(rsparkling)
config <- spark_config_kubernetes("k8s://KUBERNETES_ENDPOINT",
image = "h2oai/sparkling-water-r:3.38.0.4-1-3.0",
account = "default",
executors = 2,
version = "3.0.3",
conf = list(
"spark.ext.h2o.backend.cluster.mode"="external",
"spark.ext.h2o.external.start.mode"="auto",
"spark.ext.h2o.external.auto.start.backend"="kubernetes",
"spark.ext.h2o.external.memory"="2G",
"spark.ext.h2o.external.cluster.size"="2",
"spark.ext.h2o.external.k8s.docker.image"="h2oai/sparkling-water-external-backend:3.38.0.4-1-3.0",
"spark.kubernetes.file.upload.path"="file:///tmp"),
ports = c(8880, 8881, 4040, 54321))
config["spark.home"] <- Sys.getenv("SPARK_HOME")
sc <- spark_connect(config = config, spark_home = Sys.getenv("SPARK_HOME"))
hc <- H2OContext.getOrCreate()
spark_disconnect(sc)
You can also submit RSparkling batch job. In that case, create a file called batch.R with the content from the code box above and run:
Rscript --default-packages=methods,utils batch.R
Note: In the case of RSparkling, SparklyR automatically sets the Spark deployment mode and it is not possible to specify it.