Download Sparkling Water 2.1.4

Download Run on Hadoop Run on Standalone Cluster Kluster Use from Maven RSparkling PySparkling

Integration info

H2O version: 3.10.4.5 ueno (documentation)
Spark version: 2.1.0 (documentation)

Get started with Sparkling Water in a few easy steps

1. Download Spark (if not already installed) from the Spark Downloads Page

Choose Spark release : 2.1.0
Choose a package type: Pre-built for Hadoop 2.4 and later

2. Point SPARK_HOME to the existing installation of Spark and export variable MASTER.

export SPARK_HOME="/path/to/spark/installation"
# To launch a local Spark cluster with 3 worker nodes with 2 cores and 1g per node.
export MASTER="local[*]"

3. From your terminal, run:

cd ~/Downloads
unzip sparkling-water-2.1.4.zip
cd sparkling-water-2.1.4
bin/sparkling-shell --conf "spark.executor.memory=1g"

4. Create an H₂O cloud inside the Spark cluster:

import org.apache.spark.h2o._
val h2oContext = H2OContext.getOrCreate(sparkSession)
import h2oContext._

5. Follow this demo, which imports airlines and weather data and runs predictions on delays.

Download Sparkling Water

Integration info

H2O version: 3.10.4.5 ueno (documentation)
Spark version: 2.1.0 (documentation)

Launch Sparkling Water on Hadoop using Yarn.

1. Download Spark (if not already installed) from the Spark Downloads Page.

Choose Spark release : 2.1.0
Choose a package type: Pre-built for Hadoop 2.4 and later

2. Point SPARK_HOME to an existing installation of Spark:

export SPARK_HOME='/path/to/spark/installation'

3. Set the HADOOP_CONF_DIR and Spark MASTER environmental variables.

export HADOOP_CONF_DIR=/etc/hadoop/conf
export MASTER="yarn-client"

4. Download Spark and Use spark-submit to launch Sparkling Shell on YARN.

wget http://h2o-release.s3.amazonaws.com/sparkling-water/rel-2.1/4/sparkling-water-2.1.4.zip
unzip sparkling-water-2.1.4.zip
cd sparkling-water-2.1.4/
bin/sparkling-shell --num-executors 3 --executor-memory 2g --master yarn-client

5. Create an H₂O cloud inside the Spark cluster:

import org.apache.spark.h2o._
val h2oContext = H2OContext.getOrCreate(sparkSession)
import h2oContext._

Download Sparkling Water

Integration info

H2O version: 3.10.4.5 ueno (documentation)
Spark version: 2.1.0 (documentation)

Launch H2O on a Standalone Spark Cluster

1. Download Spark (if not already installed) from the Spark Downloads Page.

Choose Spark release : 2.1.0
Choose a package type: Pre-built for Hadoop 2.4 and later

2. Point SPARK_HOME to an existing installation of Spark:

export SPARK_HOME='/path/to/spark/installation'

3. From your terminal, run:

cd ~/Downloads
unzip sparkling-water-2.1.4.zip
cd sparkling-water-2.1.4
bin/launch-spark-cloud.sh
export MASTER="spark://localhost:7077"
bin/sparkling-shell

4. Create an H₂O cloud inside the Spark cluster:

import org.apache.spark.h2o._
val h2oContext = H2OContext.getOrCreate(sparkSession)
import h2oContext._

Integration info

H2O version: 3.10.4.5 ueno (documentation)
Spark version: 2.1.0 (documentation)

Kluster

Kluster mode of Sparkling Water supports connection to external H2O clusters (standalone/hadoop). The extended H2O cluster needs to be started with a corresponding H2O build which can be downloaded below.

1. Download and unpack Sparkling Water distribution

2. Download corresponding h2odriver for your Hadoop distribution (e.g., hdp2.2, cdh5.4) or standalone one:

bin/get-extended-h2o.sh standalone

3. Start H2O cluster, for example, in standalone mode:

java -cp h2odriver-extended.jar water.H2OApp -md5skip -name test

4 In your Sparkling Water application, create H2OContext:

Scala

import org.apache.spark.h2o._
val conf = new H2OConf(spark).setExternalClusterMode().useManualClusterStart().setCloudName("test")
val hc = H2OContext.getOrCreate(spark, conf)

Python

from pysparkling import *
conf = H2OConf(spark).set_external_cluster_mode().use_manual_cluster_start().set_cloud_name("test")
hc = H2OContext.getOrCreate(spark, conf)

List of supported Hadoop distributions: standalone cdh5.2 cdh5.3 cdh5.4 cdh5.5 cdh5.6 cdh5.7 cdh5.8 hdp2.1 hdp2.2 hdp2.3 hdp2.4 hdp2.5 hdp2.6 mapr4.0 mapr5.0 mapr5.1 iop4.2

For more info, please follow Kluster documentation.

Integration info

H2O version: 3.10.4.5 ueno (documentation)
Spark version: 2.1.0 (documentation)

Gradle-style specification for Maven artifacts

See the h2o-droplets github repository for a working example.

repositories {
mavenCentral()
}

dependencies {
compile "ai.h2o:sparkling-water-core_2.11:2.1.4"
}

See Maven Central for artifact details.

Integration info

H2O version: 3.10.4.5 ueno (documentation)
Spark version: 2.1.0 (documentation)

RSparkling

Please follow the installation and usage instructions on RSparkling page

H2O R Client

Once you have H2OContext available in RSParkling, any commands available in the R client can be used. For more information, please visit H2O-R page.

Integration info

H2O version: 3.10.4.5 ueno (documentation)
Spark version: 2.1.0 (documentation)

PySparkling

PySparkling is a Python client for the Sparkling Water applications. To use it:

1. Ensure that Spark is installed and MASTER, SPARK_HOME environmental variables are properly set.

2. Download and unpack Sparkling Water distribution.

3. Run PySparkling shell.

./bin/pysparkling

4. In your PySparkling application, create H2OContext.

from pysparkling import *
hc = H2OContext.getOrCreate(spark)

PySparkling installed from PyPi repository

1. Install PySparkling using pip

pip install pysparkling_2.1

2. In Your Python, first create SparkSession. For this step, you need to have PySpark package installed.

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("SparklingWaterApp").getOrCreate()

3. Start H2OContext.

from pysparkling import *
hc = H2OContext.getOrCreate(spark)

H2O Python Client

Once you have H2OContext available, any API calls available in the H2O Python client can be used. For more information about the Python client, please visit H2O-Python page.

Documentation

Demo Example from Git
Scala Developer Documentation (Scaladoc)
Sparkling Water Application Template