Installing and Starting

This section describes how to download and run Sparkling Water in different environments. Refer to the PySparkling and RSparkling sections for instructions on installing and running PySparkling and RSparkling.

Download and Run Locally

This section describes how to quickly get started with Sparkling Water on your personal computer (in Spark’s local cluster mode).

  1. Download and install Spark (if not already installed) from the Spark Downloads page.

    • Choose Spark release: 3.0.3

    • Choose a package type: Pre-built for Hadoop 2.7 and later

  2. Point SPARK_HOME to the existing installation of Spark and export variable MASTER.

export SPARK_HOME="/path/to/spark/installation"
# To launch a local Spark cluster.
export MASTER="local[*]"
  1. From your terminal, run:

cd ~/Downloads
unzip sparkling-water-3.44.0.2-1-3.0.zip
cd sparkling-water-3.44.0.2-1-3.0
bin/sparkling-shell
  1. Create an H2O cloud inside the Spark cluster:

import ai.h2o.sparkling._
val h2oContext = H2OContext.getOrCreate()
import h2oContext._
  1. Begin using Sparkling Water by following this demo, which imports airlines and weather data and runs predictions on delays.

Please note that when copying code into the Scala Sparkling shell, make sure to use the :paste mode feature of the Scala REPL. Otherwise, you might hit a compiler error.

Run on Hadoop

This section describes how to launch Sparkling Water on Hadoop using YARN.

  1. Download Spark (if not already installed) from the Spark Downloads page.

- Choose Spark release: 3.0.3
- Choose a package type: Pre-built for Hadoop 2.7 and later
  1. Point SPARK_HOME to the existing installation of Spark.

export SPARK_HOME='/path/to/spark/installation'
  1. Set the HADOOP_CONF_DIR and Spark MASTER environmental variables.

export HADOOP_CONF_DIR=/etc/hadoop/conf
export MASTER="yarn"
  1. Download Spark and use sparkling-shell to launch Sparkling Shell on YARN.

wget http://h2o-release.s3.amazonaws.com/sparkling-water/spark-3.0/3.44.0.2-1-3.0/sparkling-water-3.44.0.2-1-3.0.zip
unzip sparkling-water-3.44.0.2-1-3.0.zip
cd sparkling-water-3.44.0.2-1-3.0/
bin/sparkling-shell --num-executors 3 --executor-memory 2g --master yarn --deploy-mode client
  1. Create an H2O cluster inside the Spark cluster:

import ai.h2o.sparkling._
val h2oContext = H2OContext.getOrCreate()
import h2oContext._

Please note that when copying code into the Scala Sparkling shell, make sure to use the :paste mode feature of the Scala REPL. Otherwise, you might hit a compiler error.

Run on a Standalone Spark Cluster

This section describes how to launch H2O on a standalone Spark cluster.

  1. Download Spark (if not already installed) from the Spark Downloads page.

- Choose Spark release: 3.0.3
- Choose a package type: Pre-built for Hadoop 2.7 and later
  1. Point SPARK_HOME to the existing installation of Spark and export variable MASTER.

export SPARK_HOME='/path/to/spark/installation'
  1. From your terminal, run:

cd ~/Downloads
unzip sparkling-water-3.44.0.2-1-3.0.zip
cd sparkling-water-3.44.0.2-1-3.0
bin/launch-spark-cloud.sh
export MASTER="spark://localhost:7077"
bin/sparkling-shell
  1. Create an H2O cloud inside the Spark cluster:

import ai.h2o.sparkling._
val h2oContext = H2OContext.getOrCreate()
import h2oContext._

Please note that when copying code into the Scala Sparkling shell, make sure to use the :paste mode feature of the Scala REPL. Otherwise, you might hit a compiler error.

External Backend

Sparkling Water Kluster mode supports a connection to external H2O clusters (standalone/Hadoop). The H2O cluster needs to be started with a corresponding H2O, which can be downloaded as below.

  1. Download and unpack the Sparkling Water distribution.

  2. Download the corresponding H2O driver for your Hadoop distribution (e.g., hdp2.2, cdh5.4) or standalone one:

export H2O_DRIVER_JAR=$(/path/to/sparkling-water-3.44.0.2-1-3.0/bin/get-h2o-driver.sh hdp2.2)
  1. Set path to sparkling-water-assembly-extensions-3.44.0.2-1-3.0-all.jar which is bundled in Sparkling Water archive:

SW_EXTENSIONS_ASSEMBLY=/path/to/sparkling-water-3.44.0.2-1-3.0/jars/sparkling-water-assembly-extensions-3.44.0.2-1-3.0-all.jar
  1. Start an H2O cluster on Hadoop

hadoop -jar $H2O_DRIVER_JAR -libjars $SW_EXTENSIONS_ASSEMBLY -sw_ext_backend -jobname test -nodes 3 -mapperXmx 6g
  1. In your Sparkling Water application, create H2OContext:

Scala

import ai.h2o.sparkling._
val conf = new H2OConf().setExternalClusterMode().useManualClusterStart().setCloudName("test")
val hc = H2OContext.getOrCreate(conf)

Python

from pysparkling import *
conf = H2OConf().setExternalClusterMode().useManualClusterStart().setCloudName("test")
hc = H2OContext.getOrCreate(conf)

Note: The following is a list of supported Hadoop distributions: cdh5.4 cdh5.5 cdh5.6 cdh5.7 cdh5.8 cdh5.9 cdh5.10 cdh5.13 cdh5.14 cdh5.15 cdh5.16 cdh6.0 cdh6.1 cdh6.2 cdh6.3 cdp7.0 cdp7.1 cdp7.2 hdp2.2 hdp2.3 hdp2.4 hdp2.5 hdp2.6 hdp3.0 hdp3.1 mapr4.0 mapr5.0 mapr5.1 mapr5.2 mapr6.0 mapr6.1 mapr6.2 iop4.2 emr6.10

For more information, please follow the Sparkling Water Backends.

Please note that when copying code into the Scala Sparkling shell, make sure to use the :paste mode feature of the Scala REPL. Otherwise, you might hit a compiler error.

Use from Maven

This section provides a Gradle-style specification for Maven artifacts.

See the h2o-droplets GitHub repository for a working example.

repositories {
  mavenCentral()
}

dependencies {
  compile "ai.h2o:sparkling-water-package_2.12:3.44.0.2-1-3.0"
}

See Maven Central for artifact details.

Please note that when copying code into the Scala Sparkling shell, make sure to use the :paste mode feature of the Scala REPL. Otherwise, you might hit a compiler error.

Sparkling Water as a Spark Package

This section describes how to start Spark with Sparkling Water enabled via Spark package.

  1. Ensure that Spark is installed, and MASTER and SPARK_HOME environmental variables are properly set.

  2. Start Spark and point to maven coordinates of Sparkling Water:

$SPARK_HOME/bin/spark-shell --packages ai.h2o:sparkling-water-package_2.12:3.44.0.2-1-3.0
  1. Create an H2O cloud inside the Spark cluster:

import ai.h2o.sparkling._
val h2oContext = H2OContext.getOrCreate()
import h2oContext._

Please note that when copying code into the Scala Sparkling shell, make sure to use the :paste mode feature of the Scala REPL. Otherwise, you might hit a compiler error.